Audio Classification and Regression using Pytorch
In recent times the deep learning bandwagon is moving pretty fast. With all the different things you can do with it, its no surprise; images, tabular-data and all sorts of different media classification and generation algorithms have gotten quiet a boost. But one form of media that doesn’t get much love is audio. Being a huge music fan myself that’s bit of a bummer.
Recently while looking for some resources to build an audio regressor I realized that the amount of quality content for audio based ML is negligible as compared to that of other media formats. So I decided to write a little tutorial about it myself, something a bit different from the hackneyed Resnet-50 classification videos. 😏
Since covid-19 is the current hot topic right now I decided to build a audio regressor that can predict the amount of cough a person has. This is a regression model but if you want to learn to build a classification model I will cover that at the end. This blog is for both newbies as well as experienced people and will cover everything from the data preprocessing to the model creation. So lets get started.
1. Initialization
In this part I will just go over the libraries we need to import as well as how to acquire the data. if you already have some other data that you want to use feel free to skip this section.
The above code will import all the necessary libraries and initialize your device to either CPU or GPU. Since the model won’t be that resource intensive a GPU isn’t necessary but it will be better if you have one. If you don’t have one just use google colab.
As for the data lets use the coughvid dataset. You can download it from over here https://zenodo.org/record/4498364.
The public version of the dataset contains around 2000 audio files that contain audio samples from subjects representing a wide range of gender, age, geographic location, and covid-19 statuses that have been labeled by experienced pulmonologists.
2. Data Preprocessing
In this step I will go over the entire data preprocessing and pytorch dataset creation pipeline. But first of all lets take a look at the data.
The audio files are in the .webm format and the config data is in the JSON format. One big problem here is that if the audio data is not in the .wav format pytorch gets really fussy about it. And believe me converting the file formats without corrupting them is a nightmare if it is your first time doing this. But don’t worry I have a script that does the job for you.
This script uses ffmpeg which is a command line tool used to convert audio and video formats (seriously this is very useful for modifying datasets) to convert all the .webm files to .wav files. Note that this will take quiet some time depending on your hardware and also that the size of .wav files is 8–10 times that of .webm files so make sure you have the storage space for that. You can delete the .webm files after this though make sure not to modify the JSON files.
Now lets actually come to the viewing the data part (I know we have taken a long detour 😅).
The name_set just stores the names of all the files so that we can easily refer to them via index in the pytorch dataset. I will cover this later. So now lets view the content which contains data about our subject.
As you can see above we have lot of different information about the subject like his age, geographical location and the amount of cough he has. The ‘cough_detected’ parameter here is our label. This is what we want our ML model to predict.
Now lets look at the last line of the code block. Here we are using the load method which returns to us the waveform which is a tensor of the shape [channels, time] and the sample rate. The waveform basically is a 2d array of frequencies over time with ‘sample rate’ intervals per second that we feed to our ML model.
If all these terms seem abstruse to you try reading this blog that gives you some some insight about sample rate, bit rate, etc. https://www.vocitec.com/docs-tools/blog/sampling-rates-sample-depths-and-bit-rates-basic-audio-concepts
Now lets build the pytorch dataset. This can be a little overwhelming if it is your first time working with audio so just refer to the torch audio documentation if you don’t understand something.
Ok so lets get started. As you can see in the code above the __init__, __len__, and __getitem__ method are self explanatory (we are using the index to get the path of the audio file and the label). If you want to learn more about them check out the pytorch documentation.
In the __getitem__ method we are basically doing some transformations on the waveform to make it suitable for our model. The most important one of this is the Melspectrogram transformation from torch audio. This basically converts the waveform from the Hz scale to the Mel scale. The Mel Scale is constructed such that sounds of equal distance from each other on the Mel Scale, also “sound” to humans as they are equal in distance from one another. It aims to mimic the non-linear human perception of sound. For better understanding take a look at the code.
Here the sample rate is the desired sample rate we want, n_fft (non-uniform fast fourier transform) or in the context of the code the length of the n_fft of a short segment. The hop_length is the gap between the start of one segment and the subsequent segment, this is done so that the segments don’t overlap each other completely. It is also is directly responsible for our input shape. The n_mels parameter is basically the number of Mel banks or filters in the spectrogram. Basically the amount of partitions the Hz scale has been divided into.
One way to interpret the relation between the above parameters is that frame/bit rate (of the sample) = sample_rate/hop_length. (You will notice this below when I discuss the shape of our layers).
I know this can be a little convoluted so take a look at this wonderful blog explaining the Melspectrogram in detail.
Also check out this stack overflow article about the code.
But before we apply the transformation to our waveform lets make sure that all of our data has the same sample rate, number of channels, and duration. For this I have created a bunch of private methods. I will go over them in detail below.
_resample: This method resamples our audio signal so that it has the same sample rate as a desired target value. This is because the sample rate will directly affect the bounds of our Hz scale and in turn our Mel scale.
_mix_down: This method will make sure our audio signal has only one channel (will turn it into a “mono” signal). It does so by aggregating all the channels into one.
_cut: This method cuts the duration of the audio signal if it is greater than a certain length i.e. if it has more than a certain length of samples.
_right_pad: This method is the opposite of the _cut method i.e. it right pads the audio signal if it has less than the desired number of samples.
After applying all of these pre processing steps we finally transform our audio signal into a Melspectrogram and return it along with our labels.
3. ML Modelling
Phew 😅 the last section was a bit overwhelming so lets get into some model building. For this I am not going to go into some complex model architecture since the goal of this blog is to just cover the basics. So for this task we will be using a simple CNN model. I know you are wondering how can a CNN model that is used mainly for computer vision be used for an audio task, aren’t audio and video different?
Yes they are. But our audio signal/waveform after preprocessing can be interpreted as a 2d image (frequency X time). Surprisingly CNN’s are efficacious when it comes to audio data. For better understanding read this blog, the model I will show you is also heavily inspired from here.
The above is the code for our model. It has 4 convolution blocks each consisting of a Conv2d layer with a relu activation function followed by a max pooling layer to reduce the image (spectrogram) size by 2. The convolution blocks are followed by a simple flatten layer, a couple of linear/Dense layers and finally the output layer which in our case is the sigmoid layer since our outputs are bounded between 0 and 1. In case this was a classification task you would’ve just replaced the sigmoid layer with a softmax or log softmax layer.
That’s it our model is done. This is a pretty standard architecture and there are many more NN architectures that will give you a better result than this. Just recently I came across another architecture which fed the Convolution block outputs to a GRU and also incorporated residual learning fundamentals. If you are interested in improving your accuracy give this model a try.
Now I have also used torch summary (a very useful package to generate your model summaries) to generate our model summary.
As you can see beside the conv2d 2–1 layer our input shape is (-1,16,66,46). 16 is the number of filters, after subtracting 2 which is the padding we get (64,44) which is nothing but (n_mels, sample_rate/hop_length) like discussed above.
4. Training and Prediction
Moving on to the final section, lets just define some basic training utility functions that will make our lives easier, since pytorch doesn’t have a built in fit method like tensorflow. You colud also use pytorch-lightning for this.
Here the train function just trains the model using the train_single_epoch function. The values and constants and other training code is below. Nothing more to explain in this part, if you have any difficulties check out the pytorch docs.
And here is the code for running the prediction functions.
That’s it hope it was useful to you. 😃
For the code check out my github repository.
This blog was heavily inspired from this youtube playlist so if you want some additional resources check it out.
Burhanuddin Rangwala the author of this blog is an aspiring Data Scientist and Machine Learning Engineer who also works as a freelance web developer and writer. Hit him up at linkedIn for any work or freelance opportunities.