AudioClassification

Introduction

The steps all voice assistants likely use:

  1. First, the assistant must convert the speech to text.

  2. The text is run through a natural language processing (NLP) step, which turns the words into numeric data.

  3. Finally, there's a classification of the utterance - what people say to the intent - what they want the voice assistant to do.

We will be building a simple model that can understand yes and no. The dataset we will be using is the open dataset Speech Commands which is built into PyTorch datasets. This dataset has 36 total different words/sounds to be used for classification. Each utterance is stored as a one-second (or less) WAVE format file. We will only be using yes and no for binary classification

Audio Data

Just like with images we need to take our physical world and convert it to numbers or a digital representation for a computer to understand.

For audio, a microphone is used to capture the sound and then its converted from analog sound to digital sound by sampling at consistent intervals of time. This is called the sample rate.The higher the sample rate the higher the quality of the sound however after a certain point the difference is not able to be detected by the human ear. The average sound sample rate is 48 kHz or 48,000 samples per second. This dataset was sampled at 16kHz so our sample rate is 16,000.

When the audio is sampled its sampling the frequency or the pitch of the sound and the amplitude or how loud the audio is. We can then take our sample rate and frequency and represent the signal visually. This signal can be represented as a waveform which is the signal representation over time in a graphical format. The audio can be recorded in different channels. For example stereo recording have 2 channels, right and left.

how we might want to parse out a file? if you have longer audio files you may want to split it out into frames or sections of the audio to be classified individually. For this dataset we don't need to set any frames of our audio samples as each sample is only one second and one word. Another processing step might be an offset which means the number of frames from the start of the file to begin data loading.

Get setup with TorchAudio

TorchAudio is a library that is part of the PyTorch ecosystem that has I/O functionality, popular open datasets and common audio transformations that we will need to build our model. We will use this library to work with our audio data.

Download Data

1
2
3
4
5
6
7
8
9
10
11
12
folder = 'data'
filename = "data/SpeechCommands/speech_commands_v0.02/yes/00f0204f_nohash_0.wav"
default_dir = os.getcwd()
print(f'Data directory: {default_dir}/{folder}')
if not os.path.exists(folder):
print("create directory")
os.mkdir(folder)
speech_comments = torchaudio.datasets.SPEECHCOMMANDS(folder, download=True)

else:
print("Directory exists")
speech_comments = torchaudio.datasets.SPEECHCOMMANDS(folder, download=False)

Show Classes of dataset

1
2
3
4
5
6
7
8
def visualize_classes():
os.chdir(f"{folder}/SpeechCommands/speech_commands_v0.02/")
labels = [name for name in os.listdir() if os.path.isdir(name)]
os.chdir(default_dir)
print(labels)


['right', 'eight', 'cat', 'tree', 'backward', 'learn', 'bed', 'happy', 'go', 'dog', 'no', 'wow', 'follow', 'nine', 'left', 'stop', 'three', '_background_noise_', 'sheila', 'one', 'bird', 'zero', 'seven', 'up', 'visual', 'marvin', 'two', 'house', 'down', 'six', 'yes', 'on', 'five', 'forward', 'off', 'four']

Convert the sound to tensor

Wave file is one format in which we save our digital representation of our analog audio to be shared and played

使用的语音命令数据集被存储在波形文件中,这些文件都是一秒钟或更短。使用torchaudio.load加载文件,它将一个音频文件加载到一个Torch.Tensor对象中。TorchAudio已经为不同的音频后端抽象了加载函数,

1
2
3
4
5
6
7
8
9
10
waveform, sample_rate = torchaudio.load(filepath=filename)
print(f'waveform tensor:{waveform}')
waveform, sample_rate = torchaudio.load(filepath=filename, num_frames=3, frame_offset =2)
print(waveform)
waveform, sample_rate = torchaudio.load(filepath=filename)
print(waveform)

waveform tensor:tensor([[0.0005, 0.0007, 0.0005]])
tensor([[0.0005, 0.0004, 0.0007]])
tensor([[0.0005, 0.0007, 0.0005, ..., 0.0008, 0.0008, 0.0007]])

plot waveform

1
2
3
4
5
6
7
import matplotlip.pyplot as plt


plt.figure()
plt.plot(waveform.t().numpy())
plt.show()

1

Data visualization and transformation

orchAudio has many transforms available in the library. take a deeper look at understanding the following concepts and transforms: Spectrogram, MelSpectrogram, Waveform, and MFCC. Once we understand these concepts we will create our spectrogram images of the yes/no dataset to be used in the computer vision model.

the list of supported transformations

  • Resample: Resample waveform to a different sample rate.
  • Spectrogram: Create a spectrogram from a waveform.
  • GriffinLim: Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation.
  • ComputeDeltas: Compute delta coefficients of a tensor, usually a spectrogram.
  • ComplexNorm: Compute the norm of a complex tensor.
  • MelScale: This turns a normal STFT into a Mel-frequency STFT, using a conversion matrix.
  • AmplitudeToDB: This turns a spectrogram from the power/amplitude scale to the decibel scale.
  • MFCC: Create the Mel-frequency cepstrum coefficients from a waveform.
  • MelSpectrogram: Create MEL Spectrograms from a waveform using the STFT function in PyTorch.
  • MuLawEncoding: Encode waveform based on mu-law companding.
  • MuLawDecoding: Decode mu-law encoded waveform.
  • TimeStretch: Stretch a spectrogram in time without modifying pitch for a given rate.
  • FrequencyMasking: Apply masking to a spectrogram in the frequency domain.
  • TimeMasking: Apply masking to a spectrogram in the time domain.

Load the Dataset folders into a DataLoader

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from pathlib import Path

data_dir = "data/SpeechCommands/speech_commands_v0.02"

def get_data(path=os.path.join(data_dir, "yes/00f0204f_nohash_0.wav")):
print(f'Data Name: {os.path.split(path)[1]}')
if not os.path.exists(folder):
print("create directory")
os.mkdir(folder)
torchaudio.datasets.SPEECHCOMMANDS(folder, download=True)

waveform, sample_rate = torchaudio.load(filepath=path)
print(f"waveform shape: {waveform.shape} sample rate: {sample_rate}\n")
return waveform, sample_rate


def load_audio_data(label):
path = os.path.join(data_dir, label)
dataset = []
walker = sorted(str(p) for p in Path(path).glob('*.wav'))
for i, file_path in enumerate(walker):
data = dict()
file_name = os.path.split(file_path)[1]

# E.g 3d794813_nohash_0
speaker = os.path.splitext(file_name)[0]
speaker_id, utterance_number = speaker.split('_nohash_')
utterance_number = int(utterance_number)

# load_audio_data
waveform, sample_rate = get_data(file_path)
data['waveform'] = waveform
data['sample_rate'] = sample_rate
data['label'] = label
data['speaker_id'] = speaker_id
data['utterance_number'] = utterance_number
dataset.append(data)
return dataset

load the dataset into a DataLoader

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from torch.utils.data import Dataset, DataLoader


class MyDataset(Dataset):
def __init__(self, label):
super(MyDataset, self).__init__()
self.data = load_audio_data(label)

def __getitem__(self, item):
return self.data[item]

def __len__(self):
return len(self.data)


def dataloader(labels, batch_size):
dataloaders = dict()
for label in labels:
dataset = MyDataset(label)
dataloaders[label] = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True, collate_fn=lambda i: i,
num_workers=0)
return dataloaders

if __name__ == "__main__":
dataloaders = dataloader(['yes', 'no'], 2)
for i, data in enumerate(dataloaders['yes']):
if i >= 2:
break
print(data)

-------------------------------------------------------------------------
[{'waveform': tensor([[ 0.0005, -0.0010, -0.0008, ..., -0.0004, 0.0002, 0.0016]]), 'sample_rate': 16000, 'label': 'yes', 'speaker_id': '483e2a6f', 'utterance_number': 1}, {'waveform': tensor([[-0.0012, -0.0047, -0.0022, ..., -0.0013, -0.0014, -0.0022]]), 'sample_rate': 16000, 'label': 'yes', 'speaker_id': 'a60a09cf', 'utterance_number': 0}]
[{'waveform': tensor([[-6.1035e-05, -1.5259e-04, -2.4414e-04, ..., -2.4414e-04,
-2.7466e-04, -2.4414e-04]]), 'sample_rate': 16000, 'label': 'yes', 'speaker_id': '87070229', 'utterance_number': 4}, {'waveform': tensor([[-0.0013, -0.0020, -0.0030, ..., -0.0025, -0.0027, -0.0019]]), 'sample_rate': 16000, 'label': 'yes', 'speaker_id': 'e1469561', 'utterance_number': 0}]

Transform and visualize

分解一些音频转换和可视化,以更好地理解它们是什么,以及它们告诉我们关于数据的内容。

waveform

波形是由采样率和频率产生的,并以视觉方式表示信号。这个信号可以用波形来表示,它是以图形格式表示的随时间变化的信号。音频可以被记录在不同的通道中。例如,立体声录音有两个通道,右和左。

必须使用 resample变换来减少波形的大小,然后用图形来显示新的波形形状。

绘制重采样后的波形对比图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def plot_waveform(waveform, sample_rate, new_sample_rate, label):
channel = 0
waveform_transformed = torchaudio.transforms.Resample(sample_rate, new_sample_rate)(
waveform[channel, :].view(1, -1))
plt.figure(figsize=(13, 5))
plt.subplot(1, 2, 1)
plt.title(f'sample rate: {sample_rate}')
plt.plot(waveform.t().numpy())
plt.subplot(1, 2, 2)
plt.title(f'new sample rate: {new_sample_rate}')
plt.plot(waveform_transformed.t().numpy())
plt.suptitle(f'label: {label}')
plt.show()



if __name__ == "__main__":
waveform, sample_rate = get_data()
plot_waveform(waveform, sample_rate, sample_rate/100, 'yes')

2

Spectrogram

频谱图将音频文件的频率映射到时间上,并允许我们按频率将音频数据可视化。这张图就是我们对音频文件进行计算机视觉分类时要用到的东西。

1
2
3
4
5
6
7
8
def plot_spectrogram(waveform):
spectrogram = torchaudio.transforms.Spectrogram()(waveform)
print("Shape of spectrogram: {}".format(spectrogram.size()))

plt.figure()
plt.imshow(spectrogram.log2()[0, :, :].numpy(), cmap='gray')
# plt.imsave(f'test/spectrogram_img.png', spectrogram.log2()[0,:,:].numpy(), cmap='gray')
plt.show()

2

Mel Spectrogram

Mel频谱图也是关于时间的频率,但是频率被转换为Mel Scale。Mel Scale 取频率,并根据音阶或旋律的声音感知而改变。这将内部的频率转换为Mel Scale,然后创建频谱图图像。

1
2
3
4
5
6
def show_melspectrogram(waveform,sample_rate):
mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate)(waveform)
print("Shape of spectrogram: {}".format(mel_spectrogram.size()))

plt.figure()
plt.imshow(mel_spectrogram.log2()[0,:,:].numpy(), cmap='gray')

2

Mel-frequency cepstral coefficients (MFCC)

对MFCC所做的简化解释是,它采取频率,应用变换,其结果是由频率产生的频谱的振幅。

1
2
3
4
5
6
7
8
9
10
11
def show_mfcc(waveform,sample_rate):
mfcc_spectrogram = torchaudio.transforms.MFCC(sample_rate= sample_rate)(waveform)
print("Shape of spectrogram: {}".format(mfcc_spectrogram.size()))

plt.figure()
plt.gcf()
plt.imshow(mfcc_spectrogram.log2()[0,:,:].numpy(), cmap='gray')

plt.figure()
plt.plot(mfcc_spectrogram.log2()[0,:,:].numpy())
plt.draw()

2

2

Create an image from a Spectrogram

我们已经分解了一些理解音频数据的方法,以及我们可以在数据上使用的不同 trandformations。现在让我们创建用于分类的图像。下面是两个不同的函数,用于创建用于分类的频谱图或MFCC图像。在这个例子中,我们将使用频谱图图像,但是,请随意使用下面的MFCC图像功能,玩玩MFCC分类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def create_images(training_dataloader, label_dir):
# make directory
directory = f'data/spectrograms/{label_dir}/'
if (os.path.isdir(directory)):
print("Data exists")
else:
os.makedirs(directory, mode=0o777, exist_ok=True)

for batch in training_dataloader:
for data in batch:
waveform = data['waveform']
id = data['speaker_id']
utterance_number = data['utterance_number']

# create transformed waveforms
spectrogram_tensor = torchaudio.transforms.Spectrogram()(waveform)

plt.figure()
plt.imsave(f'data/spectrograms/{label_dir}/spec_img_{id}_{utterance_number}.png',
spectrogram_tensor.log2()[0, :, :].numpy(), cmap='gray')


def create_mfcc_images(training_dataloader, label_dir):
# make directory
directory = f'data/MfccSpectrograms/{label_dir}'
if (os.path.isdir(directory)):
print("Data exists")
else:
os.makedirs(directory, mode=0o777, exist_ok=True)
for batch in training_dataloader:
for data in batch:
waveform = data['waveform']
id = data['speaker_id']
sample_rate = data['sample_rate']
utterance_number = data['utterance_number']

# create transformed waveforms
mfcc_spectrogram = torchaudio.transforms.MFCC(sample_rate=sample_rate)(waveform)
plt.figure()
fg1 = plt.gcf()
plt.imshow(mfcc_spectrogram.log2()[0, :, :].numpy(), cmap='gray')
plt.draw()
fg1.savefig(os.path.join(directory, f'spec_img_{id}_{utterance_number}.png'), dpi=100)


2

2

Build Speech Model

Now that we have created the spectrogram images its time to build the computer vision model. We will be using the torchvision package to build our vision model.

Load Spectrogram images into a DataLoader for training

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, models, transforms

data_path = 'data/spectrograms'

yes_and_no_data = datasets.ImageFolder(
root=data_path,
transform=transforms.Compose([
transforms.Resize((201, 81)),
transforms.ToTensor()
]
)
)

# split data to test and train
# use 80% to train
train_size = int(0.8 * len(yes_and_no_data))
training_data, test_data = random_split(yes_and_no_data, [train_size, len(yes_and_no_data) - train_size])

print(f'train_size: {len(training_data)} test_size: {len(test_data)}')

training_dataloader = DataLoader(dataset=training_data,
batch_size=batch_size,
num_workers=num_workers,
shuffle=True)

test_dataloader = DataLoader(dataset=test_data,
batch_size=batch_size,
num_workers=num_workers,
shuffle=True)

Create NN

Using CNN to classify the audio

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.cnn = nn.Sequential(
# size (3, 201, 81)
nn.Conv2d(in_channels=3, out_channels=32, kernel_size=(5, 5)),
# size (32, 197, 77)
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(5, 5)),
# size (64, 193, 73)
nn.MaxPool2d(kernel_size=2),
# size (64, 96, 36)
nn.Dropout2d(),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(in_features=64 * 96 * 36, out_features=50),
nn.Linear(in_features=50, out_features=2)
)

def forward(self, x):
x = self.cnn(x)
return self.classifier(x)



==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
CNN -- --
├─Sequential: 1-1 [1, 64, 96, 36] --
│ └─Conv2d: 2-1 [1, 32, 197, 77] 2,432
│ └─Conv2d: 2-2 [1, 64, 193, 73] 51,264
│ └─MaxPool2d: 2-3 [1, 64, 96, 36] --
│ └─Dropout2d: 2-4 [1, 64, 96, 36] --
├─Sequential: 1-2 [1, 2] --
│ └─Flatten: 2-5 [1, 221184] --
│ └─Linear: 2-6 [1, 50] 11,059,250
│ └─Linear: 2-7 [1, 2] 102
==========================================================================================
Total params: 11,113,048
Trainable params: 11,113,048
Non-trainable params: 0
Total mult-adds (M): 770.21
==========================================================================================

Create Train and Test functions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def train(net, training_dataloader, valid_dataloader, print_step = 100, optimizer=None, loss_fn=nn.CrossEntropyLoss()):
hist = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}
optimizer = op.Adam(net.parameters(), lr=learning_rate) if optimizer is None else optimizer
for i in range(1, epoch + 1):
print(f"epoch: {i}\n------------------------------------------")
net.train()
size, acc, total_loss, batch = len(training_dataloader.dataset), 0, 0, 0
for batch, (x, y) in enumerate(training_dataloader):
pred_y = net(x)
loss = loss_fn(pred_y, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

total_loss += loss.item()
acc += (pred_y.argmax(1) == y).type(torch.float).sum().item()

if batch % print_step == 0:
print(f"train loss: {loss:>5f} [{batch * batch_size}/{size}]")
print(f"train_loss: {total_loss / batch:>5f} train_acc: {acc / size:>5f} {size}")
hist['train_loss'].append(total_loss / batch)
hist['train_acc'].append(acc / size)
net.eval()
size, acc, total_loss, count = len(valid_dataloader.dataset), 0, 0, 0
with torch.no_grad():
for x, y in valid_dataloader:
pred_y = net(x)
total_loss += loss_fn(pred_y, y).item()
acc += (pred_y.argmax(dim=1) == y).type(torch.float).sum().item()
count += 1
print(f"val_loss: {total_loss / count:>5f} val_acc: {acc / size:>5f} {size}\n")
hist['val_loss'].append(total_loss / count)
hist['val_acc'].append(acc / size)
return hist

Plot the average accuracy and Loss

1
2
3
4
5
6
7
8
9
10
11
def plot_acc_loss(hist):
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(hist['train_acc'], label='Training acc')
plt.plot(hist['val_acc'], label='Validation acc')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(hist['train_loss'], label='Training loss')
plt.plot(hist['val_loss'], label='Validation loss')
plt.legend()
plt.show()

Train and Validation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
if __name__ == "__main__":
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'device is {device}')
training_dataloader, val_dataloader = load_train_data()
model = CNN()
print(summary(model, input_size=(1, 3, 201, 81)))
hist = train(model, training_dataloader, val_dataloader, print_step=1000)
plot_acc_loss(hist)


epoch: 10
------------------------------------------
train loss: 0.303747 [0/6388]
train loss: 0.202990 [3200/6388]
train_loss: 0.250494 train_acc: 0.908109 6388
val_loss: 0.242247 val_acc: 0.909831 1597

2