Matching two audio files using FFT (Android Studio) - java

I've been working on a part of my app for the past few days where I need to simultaneously play and record an audio file. The task I need to accomplish is just to compare the recording to the audio file played and return a matching percentage. Here's what I have done so far and some context to my questions:
The target API is >15
I decided to use a .wav audio file format to simplify decoding the file
I'm using AudioRecord for recording and MediaPlayer for playing the audio file
I created a decider class in order to pass my audio file and convert it to PCM in order to perform the matching analysis
I'm using the following specs for the recording AudioFormat (CHANNEL_MONO, 16 BIT, SAMPLE_RATE = 44100)
After I pass the audio file to the decoder, I then proceed to pass it to an FFT class in order to get the frequency domain data needed for my analysis.
And below are a few questions that I have:
When I record the audio using AudioRecord, is the format PCM by default or do I need to specify this some how?
I'm trying to pass the recording to the FFT class in order to acquire the frequency domain data to perform my matching analysis. Is there a way to do this without saving the recording on the user's device?
After performing the FFT analysis on both files, do I need to store the data in a text file in order to perform the matching analysis? What are some options or possible ways to do this?
After doing a fair amount of research, all the sources that I found cover how to match the recording with a song/music contained within a data base. My goal is to see how closely two specific audio files match, how would I go about this? - Do I need to create/use hash functions in order to accomplish my goal? A detailed answer to this would be really helpful
Currently I have a separate thread for recording; separate activity for decoding the audio file; separate activity for the FFT analysis. I plan to run the matching analysis in a separate thread as well or an AsyncTask. Do you think this structure is optimal or is there a better way to do it? Also, should I pass my audio file to the decoder in a separate thread as well or can I do it in the recording thread or MatchingAnalysis thread?
Do I need to perform windowing in my operations on audio files before I can do matching comparison?
Do I need to decode the .wav file or can I just compare 2 .wav files directly instead?
Do I need to perform low-pitching operations on audio files before comparison?
In order to perform my matching comparison, what data exactly do I need to generate (power spectrum, energy spectrum, spectrogram etc)?
Am I going about this the right way or am I missing something?

In apps like Shazam, Midomi audio matching is done using technique called audio-fingerprinting which uses spectrogram and hashing.
Your first step to find FFT is correct, but then you will need to make a 2d graph between time and frequency called Spectrogram.
This spectrogram array contains more than million samples, and we can't work upon this much data. So we find peak in amplitudes. A peak will be a (time, frequency) pair corresponding to an amplitude value which is the greatest in a local neighborhood around it. The peak finding will be a computationally expensive process, and different apps or projects do this in different way. We use peaks because these will be more insensitive to background noise.
Now different songs can have same peaks, but difference will be order and time difference of occurring. So we combine these peaks into unique hashes and save them in database.
Perform the above process for each of the audio file you want your app to recognise and match them from your database. Though matching is not simple, and time difference should also be taken into account because song can be from any instant, and we have fingerprint of full song. But it is not a problem because fingerprint contains relative time difference.
It is somewhat detailed process and you can find more explanation in this link http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
There are some libraries that can do it for you dejavu (https://github.com/worldveil/dejavu) and chromaprint (Its in c++). Musicg by google is in java, but it don't perform well with background noise.
Matching two audio files is a complicated process, and like above comments I will also tell you to try first on PC then on phones.

Related

Crossfading songs when streaming to Icecast2 in Java

Some months ago, I have written an own stream source client in Java for streaming playlists to your Icecast2 server.
The logic is simple:
You have multiple "Channels" and every channel has a playlist (in this case a folder filled with mp3 files). After a channel has started, it begins streaming by picking the first song and stream it via http to the icecast2 server. As you can imagine, after a song ended, the next one is picked.
Here is the code which I am currently using for sending audio to icecast:
https://gist.github.com/z3ttee/e40f89b80af16715efa427ace43ed0b4
What I would like to achieve is to implement a crossfade between two songs. So when a song ends, it should fade out and fade in the next one simultaneously.
I am relatively new when it comes to working with audio in java. What I know, that I have to rework the way the audio is sent to icecast. But there is the problem: I have no clue how to start or where to start.
If you have any idea where or how to start, feel free to share your experience.
Thank you in advance!
I think for cross-fading, you are likely going to have to use a library that works with the audio at the PCM level. If you wish to write your own mixer, the basic steps are as follows:
read the data via the input stream
using the audio format of stream, convert the audio to pcm
as pcm, the audio values can be mixed by simple addition -- so over the course of the cross fade, ramp one side up from zero and the other down to zero
convert the audio back to the original format and stream that
The cross fade that is linear, e.g., the audio data is multiplied by steps that progress linearly from 0 to 1 or vice versa (e.g., 0.1, 0.2, 0.3,...) will tend to leave the midpoint quieter than when running the beginning or ending track solo. A sine function is often used instead to keep the sum a steady volume.
There are two libraries I know of that might be helpful for mixing, but would likely require some modification. One is TinySound, the other is AudioCue (which I wrote). The modifications required for AudioCue might be relatively painless. The output of the mixer is enclosed in the class AudioMixerPlayer, a runnable that is located on line 268 of AudioMixer.java. A possible plan would be to modify the output line of this code, substituting your broadcast line for the SourceDataLine.
I should add, the songs to be played would first be loaded into the AudioCue class, which then exposes the capability of real-time volume control. But it might be necessary to tinker with the manner in which the volume commands are issued.
I'm really interested in having this work and could offer some assistance. I'm just now getting involved in projects with Socket and SocketServer, and would like to get some hands-on with streaming audio.

Scan for a sound on a specific process

I wonder if there is a way to scan in Java for a certain volume in a predefined process. If this volume exceeds a certain dB value, a certain KeyEvent or MouseEvent should be executed only on this process, so this could also run in the background.
I couldn't see a good way to implement this in Java so far, so I'm also wondering if and how this is possible.
Thanks in advance!
Basic plan:
Set up the sound to play over a SourceDataLine (will not work for Clip)
During playback, convert the byte data to PCM (the exact algorithm will depend on the audio format)
apply an RMS filter algorithm to the PCM data (look up root mean square for more info)
if the RMS value exceeds a target value, send a notification (using a "loosely coupled" design pattern with minimal risk of blocking the audio playback).
For reading data exposed by the SourceDataLine during playback, check the first example in the tutorial "Using Files and Format Converters", the point where the code reads the following:
// Here, do something useful with the audio data that's
// now in the audioBytes array...
As far as determining what RMS value corresponds to the desired trigger DB or loudness level, things can get quite complicated. Audio data is basically relative not absolute, and the values that the RMS algorithm gives can have different "perceived" loudness depending upon the frequency content of the sound. This post, How to get the volume level from PCM data discusses some of the complications encountered in this realm. But perhaps it is overthinking the issue, depending on what you are going for.

How can I play an audio clip in a (MIDI) sequence in Java?

I am attempting to write a very simple DAW in Java but am having trouble playing an audio clip in a sequence. I have looked into both the sampled and MIDI classes in Java Sound but what I really need is a hybrid of the two.
It seems that with the MIDI classes you cannot use a sequencer for example, to play your own audio clip.
I have attempted to write my own sequencer using scheduling to play a javax.sound.sampled.Clip in a sequence but the timings vary far too much. It is not really a viable option as it doesn't keep time.
Does anybody have any suggestions of how I could get around this?
I can attest that an audio mixing system combining aspects of MIDI and samples can be written in Java, as I wrote my own and it currently works with samples and a couple real-time synths that I also wrote.
The key is making the audio data of the samples available on a per-frame basis and a frame-counting command-processor/audio-mixer that both manages the execution of "commands," and collects and mixes the audio frame data. With 44100 fps, that's accuracy in the vicinity of 0.02 milliseconds. I can describe in more detail if requested.
Another way to go, probably saner, though I haven't done it personally, would be to make use of a Java bridge to a system such as Jack.
EDIT: Answering questions in comment (12/8/19).
Audio sample data in Java is usually either held in memory (Java uses Clip) or read from a .wav file. Because the individual frames are not exposed by Clip, I wrote an alternate, and use it to hold the data as signed floats ranging -1 to 1. Signed floats are a common way to hold audio data that we are going to perform multiple operations upon.
For playback of .wav audio, Java combines reading the data with AudioInputStream and outputting with SourceDataLine. Your system will have to sit in the middle, intercepting the AudioInputStream, convert to PCM float frames, and counting the frames as you go.
A number of sources or tracks can be processed at the same time, and merged (simple addition of the normalized floats) to a single signal. This signal can be converted back to bytes and sent out for playback via a single SourceDataLine.
Counting output frames from an arbitrary 0th frame from the single SourceDataLine will help with keeping constituent incoming tracks coordinated, and will provide the frame number reference used to schedule any additional commands that you wish to execute prior to that frame being output (e.g., changing a volume/pan of a source, or a setting on a synth).
My personal alternate to a Clip is very similar to AudioCue which you are welcome to inspect and use. The main difference is that for better or worse, I'm processing everything one frame at a time in my system, and AudioCue and its "Mixer" process buffer loads. I've had several very credible people criticize my personal per-frame system as inefficient, so when I made the public API for AudioCue, I bowed to that preconception. [There are ways to add buffering to a per-frame system to recapture that efficiency, and per-frame makes scheduling simpler. So I'm sticking with my per-frame logical scheme.]
No, you can't use a sequencer to play your own clips directly.
In the MIDI world, you have to deal with samples, instruments, and soundbanks.
Very quickly, a sample is the audio data + informations such as looping points, note range covered by the sample, base volume and envelopes, etc.
An instrument is a set of samples, and a soundbank contain a set of instruments.
If you want to use your own sounds to play some music, you must make a soundbank out of them.
You will also need to use another implementation than the default provided by Java, because that default only read soundbanks in a proprietary format, which is gone since at least 15 and perhaps even 20 years.
Back in 2008-2009, there existed for example Gervill. It was able to read SF2 and DLS soundbanks. SF2 and DLS are two popular soundbank formats, several programs exist in the market, free or paid, to edit them.
If you want to go from the other way round, starting with sampled, that's also exact as you ahve noticed, you can't rely on timers, task schedule, Thread.sleep and the like to have enough precision.
The best precision you can achieve by using those is around 10ms, what's of course far too few to be acceptable for music.
The usual way to go here is to generate the audio of your music by mixing your audio clips yourself into the final clip. So you can achieve frame precision.
In fact that's very roughly what does a MIDI synthesizer.

Syncronized recording from a microphone array using the JavaSound API

I've gone through the tutorials for the Java Sound API and I've successfully read off data from my microphone.
I would now like to go a step further and get data synchronously from multiple microphones in a microphone array (like a PS3 Eye or Respeaker)
I could get a TargetDataLine for each microphone and open/start/write the input to buffers - but I don't know how to do this in a way that will give me data that I can then line up time-wise (I would like to eventually do beamforming)
When reading from something like ALSA I would get the bytes from the different microphone simultaneously, so I know that each byte from each microphone is from the same time instant - but the Java Sound API seems to have an abstration that obfuscates this b/c you are just dumping/writing data out of separate line buffers and processing it and each line is acting separately. You don't interact with the whole device/mic-array at once
However I've found someone who managed to do beamforming in Java with the Kinect 1.0 so I know it should be possible. The problem is that the secret sauce is inside a custom Mixer object inside a .jar that was pulled out of some other software.. So I don't have any easy way to figure out how they pulled it off
You will only be able to align data from multiple sources with the time synchronous accuracy to perform beam-forming if this is supported by the underlying hardware drivers.
If the underlying hardware provides you with multiple, synchronised, data-streams (e.g. recording in 2 channels - in stereo), then your array data will be time synchronised.
If you are relying on the OS to simply provide you with two independent streams, then maybe you can rely on timestamping. Do you get the timestamp of the first element? If so, then you can re-align data by dropping samples based on your sample rate. There may be a final difference (delta-t) that you will have factor in to your beam-forming algorithm.
Reading about the PS3 Eye (which has an array of microphones), you will be able to do this if the audio driver provides all the channels at once.
For Java, this probably means "Can you open the channel with an AudioFormat that includes 4 channels"? If yes, then your samples will contain multiple frames and the decoded frame data will (almost certainly) be time aligned.
To quote the Java docs : "A frame contains the data for all channels at a particular time".
IDK what "beamforming" is, but if there is hardware that can provide synchronization, using that would obviously be the best solution.
Here, for what it is worth, is what should be a plausible algorithmic way to manage synchronization.
(1) Set up a frame counter for each TargetDataLine. You will have to convert bytes to PCM as part of this process.
(2) Set up some code to monitor the volume level on each line, some sort of RMS algorithm I would assume, on the PCM data.
(3) Create a loud, instantaneous burst that reaches each microphone at the same time, one that the RMS algorithm is able to detect and to give the frame count for the onset.
(4) Adjust the frame counters as needed, and reference them going forward on each line of incoming data.
Rationale: Java doesn't offer real-time guarantees, as explained in this article on real-time, low latency audio processing. But in my experience, the correspondence between the byte data and time (per the sample rate) is very accurate on lines closest to where Java interfaces with external audio services.
How long would frame counting remain accurate without drifting? I have never done any tests to research this. But on a practical level, I have coded a fully satisfactory "audio event" scheduler based on frame-counting, for playing multipart scores via real-time synthesis (all done with Java), and the timing is impeccable for the longest compositions attempted (6-7 minutes in length).

Audio Manager subdivided into more levels than what Android provides

Android provides a default of 15 steps for its sound systems which you can access through Audio Manager. However, I would like to have finer control.
One method of doing so seems to be altering specific files within the Android system to divide the sound levels even further then default. I would like to programmatically achieve the same effect using Java.
Fine volume control is an example of the app being able to divide the sound levels into one hundred distinct intervals. How do I achieve this?
One way, in Java, to get very precise volume adjustment is to access the PCM data directly and multiply it by some factor, usually from 0 up to 1. Another is to try and access the line's volume control, if it has one. I've given up trying to do the latter. The precision is okay in terms of amplitude, but the timing is terrible. One can only have one volume change per audio buffer read.
To access the PCM data directly, one has to iterate through the audio read buffer, translate the bytes into PCM, perform the multiplication then translate back to bytes. But this gives you per-frame control, so very smooth and fast fades can be made.
EDIT: To do this in Java, first check out the sample code snippet at the start of this java tutorial link, in particular, the section with the comment
// Here, do something useful with the audio data that's now in the audioBytes array...
There are several StackOverflow questions that show code for the math to convert audio bytes to PCM and back, using Java. Should not be hard to uncover with a search.
Pretty late to the party, but I'm currently trying to solve this issue as well. IF you are making your own media player app and are running an instance of a MediaPlayer, then you can use the function setVolume(leftScalar, rightScalar) where leftScalar and rightScalar are floats in the range of 0.0 to 1.0. representing logarithmic scale volume for each respective ear.
HOWEVER, this means that you must have a reference to the currently active MediaPlayer instance. If you are making a music app, no biggie. If you're trying to run a background service that allows users to give higher precision over all media output, I'm not sure how to use this in that scenario.
Hope this helps.

Categories

Resources