Computer vision with PyTorch (Dense && Convolutional layers)

发表于 2021-06-24 更新于 2021-06-28 分类于 PyTorch

本文字数： 32k 阅读时长 ≈ 29 分钟

Task

use image classification tasks to learn about convolutional neural networks, and then see how pre-trained networks and transfer learning can improve our models and solve real-world problems.

Introduction to image data

In computer vision, we normally solve one of the following problems:

Image Classification the simplest task, when we need to classify an image into one of many pre-defined categories, for example, distinguish a cat from a dog on a photograph, or recognize a handwritten digit.
Object Detection a bit more difficult task, in which we need to find known objects on the picture and localize them, i.e. return the bounding box for each of recognized objects.
Segmentation similar to object detection, but instead of giving bounding box we need to return an exact pixel map outlining each of the recognized objects.

mFBCV

Multi-dimensional arrays are also called tensors. Using tensors to represent images also has an advantage, because we can use an extra dimension to store a sequence of images. For example, to represent a video fragment consisting of 200 frames with 800x600 dimension, we may use the tensor of size 200x3x600x800.

Import packages and load the MNIST Dataset

we are using the well-known MNIST dataset of handwritten digits, available through torchvison.datasets.MNIST in PyTorch. The dataset object returns the data in the form of Python Imagine Library (PIL) images, which we convert to tensors by passing a transform=ToTensor() parameter.

import torch
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.MNIST(
	root='data',
  train=True,
  download=False,
  transform=ToTensor()
)

test_data = datasets.MNIST(
	root='data',
  train=False,
  download=False,
  transform=ToTensor()
)

If you are planning to load your own images, it is important to make sure that all values are scaled to the range 0 - 1 before we start training a neural network.

Training a dense neural network

The handwritten digit recognition is a classification problem. We will start with the simplest possible approach for image classification - a fully-connected neural network (which is also called a perceptron).

Fully-connected dense neural networks

A basic neural network in PyTorch consists of a number of layers. The simplest network would include just one fully-connected layer, which is called Linear layer, with 784 inputs (one input for each pixel of the input image) and 10 outputs (one output for each class).

the dimension of our digit images is 1×28×281×28×28. Because the input dimension of a fully-connected layer is 784, we need to insert another layer into the network, called Flatten, to change tensor shape from 1×28×281×28×28 to 784784.

We want nn-th output of the network to return the probability of the input digit being equal to nn. Because the output of a fully-connected layer is not normalized to be between 0 and 1, it cannot be thought of as probability. To turn it into a probability we need to apply another layer called Softmax.

In PyTorch, it is easier to use LogSoftmax function, which will also compute logarithms of output probabilities. To turn the output vector into the actual probabilities, we need to take torch.exp of the output.

net = nn.Sequential(
        nn.Flatten(), 
        nn.Linear(784,10), # 784 inputs, 10 outputs
        nn.LogSoftmax())

Training the network

The training process steps are as follows:

We take a minibatch from the input dataset, which consists of input data (features) and expected result (label).

We calculate the predicted result for this minibatch.

The difference between this result and expected result is calculated using a special function called the loss function

We calculate the gradients of this loss function with respect to model weights (parameters), which are then used to adjust the weights to optimize the performance of the network. The amount of adjustment is controlled by a parameter called learning rate, and the details of optimization algorithm are defined in the optimizer object.

We repeat those steps until the whole dataset is processed. One complete pass through the dataset is called an epoch.

def train_epoch(self, epoch, lr=1e-2, optimizer=None):
  optimizer = optimizer if optimizer is not None else torch.optim.Adam(model.parameters(), lr=lr)
  res = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}
  for i in range(1, epoch + 1):
    print(f"epoch: {i}\n---------------------------------")
    self.model.train()
    total_loss, acc, count = 0, 0, len(self.training_dataloader.dataset)
    for batch, (X, y) in enumerate(self.training_dataloader):
      pred_y = self.model(X)
      loss = self.loss_fn(pred_y, y)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      total_loss += loss.item()
      acc += (pred_y.argmax(1) == y).type(torch.float).sum().item()
      if batch % 100 == 0:
        print(f"loss: {loss:>5f}    [{batch * self.batch_size}/{count}]")
        res['train_loss'].append(total_loss / count * self.batch_size)
        res['train_acc'].append(acc / count)
        print(f"train_loss: {total_loss / count * self.batch_size :>5f}   train_acc: {acc / count:>5f}")

        self.model.eval()
        total_loss, acc, count = 0, 0, len(self.test_dataloader.dataset)
        with torch.no_grad():
          for X, y in self.test_dataloader:
            pred_y = self.model(X)
            loss = self.loss_fn(pred_y, y)
            total_loss += loss.item()
            acc += (pred_y.argmax(1) == y).type(torch.float).sum().item()
            res['val_loss'].append(total_loss / count * self.batch_size)
            res['val_acc'].append(acc / count)
            print(f"val_loss: {total_loss / count * self.batch_size:>5f}  val_acc: {acc / count:>5f} \n ")

     return res

Switch the network to training mode (net.train())
Go over all batches in the dataset, and for each batch do the following:
- compute predictions made by the network on this batch (out)
- compute loss, which is the discrepancy between predicted and expected values
- try to minimize the loss by adjusting weights of the network (optimizer.step())
- compute the number of correctly predicted cases (accuracy)

visualize history to better understand our model training

plt.figure(figsize=(13,5))
plt.subplot(1,2,1)
plt.plot(hist['train_acc'], label='Training acc')
plt.plot(hist['val_acc'], label='Validation acc')
plt.legend()
plt.subplot(1,2,2)
plt.plot(hist['train_loss'], label='Training loss')
plt.plot(hist['val_loss'], label='Validation loss')
plt.legend()
plt.show()

The diagram on the left shows the training accuracy increasing (which corresponds to the network learning to classify our training data better and better), while validation accuracy starts to fall. The diagram on the right show the training loss and validation loss, you can see the training loss decreasing (meaning its performing better) and the validation loss increasing (meaning its performing worse). These graphs would indicate the model is overfitted.

Visualizing network weights

multiplying the initial image by a weight matrix allowing us to visualize the network weights with a bit of added logic.

def visualize_weights(self):
    weight_tensor = next(self.model.parameters())
    fig, ax = plt.subplots(1, 10, figsize=(20,4))
    for i, x in enumerate(weight_tensor):
        ax[i].imshow(x.view(28, 28).detach())
        plt.show()

Training a multi-Layered perceptron

In a multi-layer network, we will add one or more hidden layers.

dense-multilayer-network

A number of parameters of a neural network should be chosen depending on the dataset size, to prevent overfitting.

there is the non-linear activation function layer, called ReLU. if a network consisted just of a series of linear layers, it would essentially be equivalent to one linear layer.

multilayer-network-layers

relu_fn = torch.relu
sigmoid_fn = torch.sigmoid

plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.title("ReLU")
plt.plot(range(-10,10), [relu_fn(torch.tensor(x, dtype=torch.float)).item() for x in range(-10, 10)])
plt.subplot(1,2,2)
plt.title("Sigmoid")
plt.plot(range(-10,10), [sigmoid_fn(torch.tensor(x, dtype=torch.float)).item() for x in range(-10, 10)])
plt.show()

Network Definition

network layer structure:

multilayer-network-layers

net = nn.Sequential(
    nn.Flatten(),
    nn.Linear(in_features=28 * 28, out_features=100),
    nn.ReLU(),
    nn.Linear(in_features=100, out_features=10),
    nn.LogSoftmax(dim=0)
)
summary(net, input_size=(1,28,28))

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Model                                    --                        --
├─Sequential: 1-1                        [1, 10]                   --
│    └─Flatten: 2-1                      [1, 784]                  --
│    └─Linear: 2-2                       [1, 100]                  78,500
│    └─ReLU: 2-3                         [1, 100]                  --
│    └─Linear: 2-4                       [1, 10]                   1,010
│    └─LogSoftmax: 2-5                   [1, 10]                   --
==========================================================================================
Total params: 79,510
Trainable params: 79,510
Non-trainable params: 0
Total mult-adds (M): 0.08
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.32
Estimated Total Size (MB): 0.32
==========================================================================================

This network is more expressive than the one layered perceptron we have trained in the previous unit. Thus it achieves a much higher training accuracy and given sufficiently large number of parameters - it can get to almost 100%
Once the validation accuracy stops increasing - it means that the model has reached it's ability to generalize, and further training will likely to result in overfitting.

Class-based network definitions

Defining models using a Sequential style as a list of layers seems very convenient but it is somewhat limited. At some point you may need to define more complex networks, which contain shared weights, or some non-linear connections between layers.

class ClassBasedNet(nn.Module):
    def __init__(self):
        super(ClassBasedNet, self).__init__()
        self.flatten = nn.Flatten()
        self.hidden = nn.Linear(in_features=28 * 28, out_features=100)
        self.out = nn.Linear(in_features=100, out_features=10)
        self.relu = nn.ReLU()
        self.log_softmax = nn.LogSoftmax(dim=0)

    def forward(self, x):
        x = self.flatten(x)
        x = self.hidden(x)
        x = self.relu(x)
        x = self.out(x)
        x = self.log_softmax(x)
        return x

     
    
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
ClassBasedNet                            --                        --
├─Flatten: 1-1                           [1, 784]                  --
├─Linear: 1-2                            [1, 100]                  78,500
├─ReLU: 1-3                              [1, 100]                  --
├─Linear: 1-4                            [1, 10]                   1,010
├─LogSoftmax: 1-5                        [1, 10]                   --
==========================================================================================
Total params: 79,510
Trainable params: 79,510
Non-trainable params: 0
Total mult-adds (M): 0.08
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.32
Estimated Total Size (MB): 0.32
==========================================================================================

自定义神经网络由一个继承自 torch.nn.Module 类的类来表示。类的定义包括两部分：

在构造函数（init）中，我们定义了我们的网络将拥有的所有层。这些层被存储为类的内部变量，PyTorch自动优化这些层的参数。在内部，PyTorch使用parameters()方法来寻找所有可训练的参数，nn.Module自动从所有子模块中收集所有可训练的参数。

定义了forward method，对神经网络进行正向传递计算。在案例中，我们从一个参数 tensor x开始，明确地通过所有的层和激活函数，从flatten开始，直到最后的线性层out。当我们通过写out = net(x)将我们的神经网络应用于一些输入数据x时，前向方法被调用。

convolutional neural network

we will learn about Convolutional Neural Networks (CNNs), which are specifically designed for computer vision.计算机视觉不同于一般的分类，因为当我们试图在图片中找到某个物体时，我们是在扫描图片，寻找一些特定的模式和它们的组合。例如，在寻找一只猫时，我们首先可能会寻找水平线，这可以形成胡须，然后胡须的某些组合可以告诉我们，这实际上是一张猫的照片。某些图案的相对位置和存在是重要的，而不是它们在图像上的确切位置。

Convolutional filters

Convolutional filters are small windows that run over each pixel of the image and compute weighted average of the neighboring pixels.They are defined by matrices of weight coefficients. Let's see the examples of applying two different convolutional filters over our MNIST handwritten digits:

def plot_convolution(training_data, kernel, title=""):
    with torch.no_grad():
        c = nn.Conv2d(kernel_size=kernel.size(), out_channels=1, in_channels=1)
        c.weight.copy_(kernel)
        fig, ax = plt.subplots(2, 6, figsize=(8, 3))
        fig.suptitle(title, fontsize=16)
        for i in range(5):
            # [1,28,28]
            img = training_data[i][0]
            ax[0][i].imshow(img[0])
            ax[0][i].axis("off")
            # [26, 26]
            ax[1][i].imshow(c(img.unsqueeze(0))[0][0])
            ax[1][i].axis("off")
        ax[0, 5].imshow(kernel)
        ax[0, 5].axis('off')
        ax[1, 5].axis('off')
        plt.show()
        
if __name__ == "__main__":
    load_mnist()
    Vertical_edge_filter = torch.tensor([[-1., 0., 1.], [-1., 0., 1.], [-1., 0., 1.]])
    Horizontal_edge_filter = torch.tensor([[-1.,-1.,-1.],[0.,0.,0.],[1.,1.,1.]])
    plot_convolution(training_data, Vertical_edge_filter, "Vertical edge filter")
    plot_convolution(training_data, Horizontal_edge_filter, "Horizontal edge filter")

vertical edge filter: $$ (

\[\begin{matrix} -1 & 0 & 1 \\ -1 & 0 & 1 \\ -1 & 0 & 1 \end{matrix}\]

) $$

Covolutional layers

Convolutional layers are defined using nn.Conv2d construction:

in_channels - number of input channels. In our case we are dealing with a grayscale image, thus number of input channels is 1.

out_channels - number of filters to use. We will use 9 different filters, which will give the network plenty of opportunities to explore which filters work best for our scenario.

kernel_size is the size of the sliding window. Usually 3x3 or 5x5 filters are used.

Simplest CNN contains one convolutional layer.

Given the input size 28x28, applying nine 5x5 filters
end up with a tensor of 9x24x24 (there are only 24 positions where a sliding interval of length 5 can fit into 28 pixels).
flatten 9x24x24 tensor into one vector of size 5184, and then add linear layer, to produce 10 classes. (use relu activation function in between layers.)

class SimplestConv(nn.Module):
    def __init__(self):
        super(SimplestConv, self).__init__()
        self.conv = nn.Conv2d(kernel_size=(5,5), in_channels=1, out_channels=9)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(in_features=9*24*24, out_features=10)
        self.relu = nn.ReLU()
        self.log_softmax = nn.LogSoftmax(dim=0)


    def forward(self, x):
        x = self.conv(x)
        x = self.relu(x)
        x = self.flatten(x)
        x = self.linear(x)
        x = self.log_softmax(x)
        return x
      
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
SimplestConv                             --                        --
├─Conv2d: 1-1                            [1, 9, 24, 24]            234
├─ReLU: 1-2                              [1, 9, 24, 24]            --
├─Flatten: 1-3                           [1, 5184]                 --
├─Linear: 1-4                            [1, 10]                   51,850
├─LogSoftmax: 1-5                        [1, 10]                   --
==========================================================================================
Total params: 52,084
Trainable params: 52,084
Non-trainable params: 0
Total mult-adds (M): 0.19
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.04
Params size (MB): 0.21
Estimated Total Size (MB): 0.25
==========================================================================================

epoch: 1
---------------------------------
loss: 4.846857    [0/60000]
loss: 2.922879    [12800/60000]
loss: 2.829096    [25600/60000]
loss: 2.816400    [38400/60000]
loss: 2.833445    [51200/60000]
train_loss: 2.916213   train_acc: 0.952633
val_loss: 2.824616  val_acc: 0.972700 
 
epoch: 2
---------------------------------
loss: 2.841105    [0/60000]
loss: 2.804442    [12800/60000]
loss: 2.739106    [25600/60000]
loss: 2.762817    [38400/60000]
loss: 2.776352    [51200/60000]
train_loss: 2.796383   train_acc: 0.976933
val_loss: 2.808884  val_acc: 0.975700 
 
epoch: 3
---------------------------------
loss: 2.801105    [0/60000]
loss: 2.779864    [12800/60000]
loss: 2.730634    [25600/60000]
loss: 2.752127    [38400/60000]
loss: 2.773396    [51200/60000]
train_loss: 2.775327   train_acc: 0.979883
val_loss: 2.806651  val_acc: 0.974600 
 
epoch: 4
---------------------------------
loss: 2.797673    [0/60000]
loss: 2.769010    [12800/60000]
loss: 2.745473    [25600/60000]
loss: 2.741996    [38400/60000]
loss: 2.784167    [51200/60000]
train_loss: 2.764126   train_acc: 0.981950
val_loss: 2.808089  val_acc: 0.973000 
 
epoch: 5
---------------------------------
loss: 2.784624    [0/60000]
loss: 2.738271    [12800/60000]
loss: 2.741519    [25600/60000]
loss: 2.739913    [38400/60000]
loss: 2.778515    [51200/60000]
train_loss: 2.755174   train_acc: 0.982967
val_loss: 2.819919  val_acc: 0.971700

visualize the weights of our trained convolutional layers, to try and make some more sense of what is going on:

def visualize_weights(self):
  weight_tensor = next(self.model.parameters())
  fig, ax = plt.subplots(1,9, figsize=(10,3))
  with torch.no_grad():
    for i, x in enumerate(weight_tensor):
      ax[i].imshow(x.detach().cpu().squeeze(dim=0))
      ax[i].axis("off")
      plt.show()

Multi-layered CNNs and pooling layers

reducing the spatial size of the image："scale down" the size of the image, which is done using one of the pooling layers:

Average Pooling takes a sliding window (for example, 2x2 pixels) and computes an average of values within the window
Max Pooling replaces the window with the maximum value. The idea behind max pooling is to detect a presence of a certain pattern within the sliding window.

in a typical CNN there would be several convolutional layers, with pooling layers in between them to decrease dimensions of the image. We would also increase the number of filters, because as patterns become more advanced - there are more possible interesting combinations that we need to be looking for.

cnn-pyramid

class MultiLayerCNN(nn.Module):
    def __init__(self):
        super(MultiLayerCNN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=10, kernel_size=(5, 5))
        self.maxPool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=10, out_channels=20, kernel_size=(5, 5))
        self.linear = nn.Linear(in_features=20*4*4, out_features=10)
        self.logSoftmax = nn.LogSoftmax(dim=1)
        self.relu = nn.ReLU()
        self.flatten = nn.Flatten()

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.maxPool(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.maxPool(x)

        x = self.flatten(x)
        x = self.linear(x)
        return self.logSoftmax(x)
      
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
MultiLayerCNN                            --                        --
├─Conv2d: 1-1                            [1, 10, 24, 24]           260
├─ReLU: 1-2                              [1, 10, 24, 24]           --
├─MaxPool2d: 1-3                         [1, 10, 12, 12]           --
├─Conv2d: 1-4                            [1, 20, 8, 8]             5,020
├─ReLU: 1-5                              [1, 20, 8, 8]             --
├─MaxPool2d: 1-6                         [1, 20, 4, 4]             --
├─Flatten: 1-7                           [1, 320]                  --
├─Linear: 1-8                            [1, 10]                   3,210
├─LogSoftmax: 1-9                        [1, 10]                   --
==========================================================================================
Total params: 8,490
Trainable params: 8,490
Non-trainable params: 0
Total mult-adds (M): 0.47
==========================================================================================

Playing with real images from the CIFAR-10 dataset

def plot_dataset(dataset):
    cols, rows = 2, 8
    plt.figure(figsize=(10, 3))
    n = len(dataset.classes)

    for i in range(1, rows * cols + 1):
        id = torch.randint(len(dataset), size=(1,)).item()
        mn = min([dataset[id][0].min() for i in range(n)])
        mx = max([dataset[id][0].max() for i in range(n)])
        x, y = dataset[id]
        x = np.transpose((x-mn)/(mx-mn), (1, 2, 0))
        plt.subplot(cols, rows, i)
        plt.axis("off")
        plt.title(dataset.classes[y])
        plt.imshow(x)
    plt.show()

A well-known architecture for CIFAR-10 is called LeNet, and has been proposed by Yann LeCun. It follows the same principles as we have outlined above, the main difference being 3 input color channels instead of 1.

We also do one more simplification to this model - we do not use log_softmax as output activation function, and just return the output of last fully-connected layer. In this case we can just use CrossEntropyLoss loss function to optimize the model.

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=(5,5))
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=(5,5))
        self.conv3 = nn.Conv2d(in_channels=16, out_channels=120, kernel_size=(5,5))

        self.maxPool = nn.MaxPool2d(kernel_size=2)
        self.linear1 = nn.Linear(in_features=120, out_features=64)
        self.linear2 = nn.Linear(in_features=64, out_features=10)

        self.relu = nn.ReLU()
        self.flatten = nn.Flatten()

    def forward(self, x):
        x = self.maxPool(self.relu(self.conv1(x)))
        x = self.maxPool(self.relu(self.conv2(x)))
        x = self.relu(self.conv3(x))

        x = self.flatten(x)
        x = self.linear1(x)
        x = self.relu(x)
        return self.linear2(x)
      
if __name__ == "__main__":
    DataSet.load_cifar_10()
    model = LeNet()
    summary(model, input_size=(1,3,32,32))
    
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
LeNet                                    --                        --
├─Conv2d: 1-1                            [1, 6, 28, 28]            456
├─ReLU: 1-2                              [1, 6, 28, 28]            --
├─MaxPool2d: 1-3                         [1, 6, 14, 14]            --
├─Conv2d: 1-4                            [1, 16, 10, 10]           2,416
├─ReLU: 1-5                              [1, 16, 10, 10]           --
├─MaxPool2d: 1-6                         [1, 16, 5, 5]             --
├─Conv2d: 1-7                            [1, 120, 1, 1]            48,120
├─ReLU: 1-8                              [1, 120, 1, 1]            --
├─Flatten: 1-9                           [1, 120]                  --
├─Linear: 1-10                           [1, 64]                   7,744
├─ReLU: 1-11                             [1, 64]                   --
├─Linear: 1-12                           [1, 10]                   650
==========================================================================================

Pre-trained models and transfer learning

训练CNN可能需要很多时间，而且这项任务需要大量的数据。然而，大部分时间是用来学习网络用来从图像中提取模式的best low-level filters。

转移学习，将一些知识从一个神经网络模型转移到另一个。在迁移学习中，我们通常从一个预先训练好的模型开始，这个模型已经在一些大型图像数据集上训练过了，比如ImageNet。这些模型已经可以很好地从通用图像中提取不同的特征，在很多情况下，只要在这些提取的特征之上建立一个分类器就可以产生一个好的结果。

Playing with Cats vs. Dogs Dataset

solving a real-life problem of classifying images of cats and dogs. we will use Kaggle Cats vs. Dogs Dataset, which can also be downloaded from Microsoft.

data_url = "https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip"
data_path = 'data/kagglecatsanddogs_3367a.zip'
data_dir = 'data/PetImages'
frame_name = 'kagglecatsanddogs_3367a.zip'
root = 'data'


def download():
    if not os.path.exists(data_path):
        wget.download(data_url, data_path)
    if not os.path.exists(data_dir):
        with zipfile.ZipFile(data_path, 'r') as zip_ref:
            zip_ref.extractall(root)
    check_image_dir(data_dir + '/Cat/*.jpg')
    check_image_dir(data_dir + '/Dog/*.jpg')
    
def check_image(fn):
    try:
        im = Image.open(fn)
        im.verify()
        return True
    except Exception:
        return False


def check_image_dir(path):
    for fn in glob.glob(path):
        if not check_image(fn):
            print(f"Corrupt image: {fn}")
            os.remove(fn)

load the images into PyTorch dataset, converting them to tensors and doing some normalization. We will apply std_normalize transform to bring images to the range expected by pre-trained VGG network:

std_normalize = torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
transform = torchvision.transform.Compose(
		[
      	torchvision.transforms.Resize(256),
      	torchvision.transforms.CenterCrop(224),
      	torchvision.transforms.ToTensor(),
      	std_normalize
    ]
)
dataset = torchvision.datasets.ImageFolder(data_dir, transform=transform)
training_data, test_data = torch.utils.data.random_split(dataset, [20000, len(dataset)-20000])
training_dataloader = torch.utils.data.DataLoader(training_data, bach_size)
test_dataloader = torch.utils.data.DataLoader(test_data, bach_size)

plot_dataset(dataset, dataset.classes)

def plot_dataset(dataset, classes):
    plt.figure(figsize=(10, 3))
    cols, rows = 2, 8
    for i in range(1, cols * rows + 1):
        id = torch.randint(len(dataset), size=(1,)).item()
        x, y = dataset[id]
        mn = min([dataset[id][0].min() for i in range(len(classes))])
        mx = max([dataset[id][0].max() for i in range(len(classes))])
        x = transpose((x - mn) / (mx - mn), (1, 2, 0))

        plt.subplot(cols, rows, i)
        plt.title(classes[y])
        plt.axis('off')
        plt.imshow(x)
    plt.show()

pre-trained models

There are many different pre-trained models available inside torchvision module, and even more models can be found on the Internet. Let's see how simplest VGG-16 model can be loaded and used:

vgg = torchvision.models.vgg16(pretrained=True)


==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Vgg16                                    --                        --
├─VGG: 1-1                               [1, 1000]                 --
│    └─Sequential: 2-1                   [1, 512, 7, 7]            --
│    │    └─Conv2d: 3-1                  [1, 64, 224, 224]         1,792
│    │    └─ReLU: 3-2                    [1, 64, 224, 224]         --
│    │    └─Conv2d: 3-3                  [1, 64, 224, 224]         36,928
│    │    └─ReLU: 3-4                    [1, 64, 224, 224]         --
│    │    └─MaxPool2d: 3-5               [1, 64, 112, 112]         --
│    │    └─Conv2d: 3-6                  [1, 128, 112, 112]        73,856
│    │    └─ReLU: 3-7                    [1, 128, 112, 112]        --
│    │    └─Conv2d: 3-8                  [1, 128, 112, 112]        147,584
│    │    └─ReLU: 3-9                    [1, 128, 112, 112]        --
│    │    └─MaxPool2d: 3-10              [1, 128, 56, 56]          --
│    │    └─Conv2d: 3-11                 [1, 256, 56, 56]          295,168
│    │    └─ReLU: 3-12                   [1, 256, 56, 56]          --
│    │    └─Conv2d: 3-13                 [1, 256, 56, 56]          590,080
│    │    └─ReLU: 3-14                   [1, 256, 56, 56]          --
│    │    └─Conv2d: 3-15                 [1, 256, 56, 56]          590,080
│    │    └─ReLU: 3-16                   [1, 256, 56, 56]          --
│    │    └─MaxPool2d: 3-17              [1, 256, 28, 28]          --
│    │    └─Conv2d: 3-18                 [1, 512, 28, 28]          1,180,160
│    │    └─ReLU: 3-19                   [1, 512, 28, 28]          --
│    │    └─Conv2d: 3-20                 [1, 512, 28, 28]          2,359,808
│    │    └─ReLU: 3-21                   [1, 512, 28, 28]          --
│    │    └─Conv2d: 3-22                 [1, 512, 28, 28]          2,359,808
│    │    └─ReLU: 3-23                   [1, 512, 28, 28]          --
│    │    └─MaxPool2d: 3-24              [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-25                 [1, 512, 14, 14]          2,359,808
│    │    └─ReLU: 3-26                   [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-27                 [1, 512, 14, 14]          2,359,808
│    │    └─ReLU: 3-28                   [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-29                 [1, 512, 14, 14]          2,359,808
│    │    └─ReLU: 3-30                   [1, 512, 14, 14]          --
│    │    └─MaxPool2d: 3-31              [1, 512, 7, 7]            --
│    └─AdaptiveAvgPool2d: 2-2            [1, 512, 7, 7]            --
│    └─Sequential: 2-3                   [1, 1000]                 --
│    │    └─Linear: 3-32                 [1, 4096]                 102,764,544
│    │    └─ReLU: 3-33                   [1, 4096]                 --
│    │    └─Dropout: 3-34                [1, 4096]                 --
│    │    └─Linear: 3-35                 [1, 4096]                 16,781,312
│    │    └─ReLU: 3-36                   [1, 4096]                 --
│    │    └─Dropout: 3-37                [1, 4096]                 --
│    │    └─Linear: 3-38                 [1, 1000]                 4,097,000
==========================================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
Total mult-adds (G): 15.48
==========================================================================================
Input size (MB): 0.60
Forward/backward pass size (MB): 108.45
Params size (MB): 553.43
Estimated Total Size (MB): 662.49
==========================================================================================

DropOut: 正则化对学习算法做了轻微的修改，因此模型的泛化效果更好。在训练过程中，剔除层丢弃了前一层的某些比例（大约30%）的神经元，在没有它们的情况下进行训练。这有助于使优化过程脱离局部最小值，并在不同的神经路径之间分配决定性的力量，从而提高网络的整体稳定性。

Extracting VGG features

If we want to use VGG-16 to extract features from our images, we need the model without final classification layers. In fact, this "feature extractor" can be obtained using vgg.features method:

res = vgg.features(sample_image).cpu()
plt.figure(figsize=(15,3))
plt.imshow(res.detach().view(-1,512))
plt.show()

those features can be used to classify images. Let's manually take some portion of images (800 in our case), and pre-compute their feature vectors. store the result in one big tensor called feature_tensor, and also labels into label_tensor:

def manual_feature_extraction(training_dataloader):
    # The dimension of feature tensor is 512x7x7,
    num = batch_size * 100
    vgg = vgg16(pretrained=True).to(device)
    feature_tensor = torch.zeros(num, 512 * 7 * 7).to(device)
    label_tensor = torch.zeros(num).to(device)
    with torch.no_grad():
        for batch, (x, y) in enumerate(training_dataloader):
            f = vgg.features(x.to(device))
            feature_tensor[batch:batch + batch_size] = f.view(batch_size, -1)
            label_tensor[batch:batch + batch_size] = y
            if batch * batch_size > num:
                break
    vgg_dataset = TensorDataset(feature_tensor, label_tensor.to(torch.long))
    size, train_size = len(vgg_dataset), int(len(vgg_dataset) / 7 * 6)

    training_data, test_data = random_split(vgg_dataset, [train_size, size - train_size])
    training_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)
    test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)
    return training_dataloader, test_dataloader
    
if __name__ == "__main__":
    load_data()
    net = nn.Sequential(nn.Linear(512*7*7, 2), nn.Softmax()).to(device)
    train_feature_loader, test_feature_loader = manual_feature_extraction(training_dataloader)
    print("features extraction done")
    hist = train(net, train_feature_loader, test_feature_loader)
    displayutils.plot_acc_loss(hist)
    
    
features extraction done
epoch: 1
------------------------------------------
train loss: 0.6915339231491089  [0/2742]
train_loss: 0.692696     train_acc: 0.920496     2742
val_loss: 0.675718     val_acc: 0.989083     458
    
epoch: 2
------------------------------------------
train loss: 0.6835835576057434  [0/2742]
train_loss: 0.676036     train_acc: 0.998906     2742
val_loss: 0.664526     val_acc: 0.993450     458
    
epoch: 3
------------------------------------------
train loss: 0.6631014347076416  [0/2742]
train_loss: 0.664522     train_acc: 0.999635     2742
val_loss: 0.652398     val_acc: 0.993450     458
    
epoch: 4
------------------------------------------
train loss: 0.6546106338500977  [0/2742]
train_loss: 0.654346     train_acc: 1.000000     2742
val_loss: 0.643583     val_acc: 0.993450     458

Transfer learning using one VGG network

the VGG contains:

feature extractor (features), comprised of a number of convolutional and pooling layers
average pooling layer (avgpool)
final classifier, consisting of several dense layers, which turns 25088 input features into 1000 classes (which is the number of classes in ImageNet)

To train the end-to-end model that will classify our dataset, we need to:

replace the final classifier with the one that will produce required number of classes. In our case, we can use one Linear layer with 25088 inputs and 2 output neurons.
freeze weights of convolutional feature extractor, so that they are not trained. It is recommended to initially do this freezing, because otherwise untrained classifier layer can destroy the original pre-trained weights of convolutional extractor. Freezing weights can be accomplished by setting requires_grad property of all parameters to False

class TransferVgg16(nn.Module):
    def __init__(self):
        super(TransferVgg16, self).__init__()
        self.vgg16 = vgg16(pretrained=True).to(device)
        self.vgg16.classifier = nn.Sequential(
            nn.Linear(in_features=512 * 7 * 7, out_features=2),
            nn.Softmax()
        )
        for x in self.vgg16.features.parameters():
            x.requires_grad = False

    def forward(self, x):
        return self.vgg16(x)
      
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
TransferVgg16                            --                        --
├─VGG: 1-1                               [1, 2]                    --
│    └─Sequential: 2-1                   [1, 512, 7, 7]            --
│    │    └─Conv2d: 3-1                  [1, 64, 244, 244]         (1,792)
│    │    └─ReLU: 3-2                    [1, 64, 244, 244]         --
│    │    └─Conv2d: 3-3                  [1, 64, 244, 244]         (36,928)
│    │    └─ReLU: 3-4                    [1, 64, 244, 244]         --
│    │    └─MaxPool2d: 3-5               [1, 64, 122, 122]         --
│    │    └─Conv2d: 3-6                  [1, 128, 122, 122]        (73,856)
│    │    └─ReLU: 3-7                    [1, 128, 122, 122]        --
│    │    └─Conv2d: 3-8                  [1, 128, 122, 122]        (147,584)
│    │    └─ReLU: 3-9                    [1, 128, 122, 122]        --
│    │    └─MaxPool2d: 3-10              [1, 128, 61, 61]          --
│    │    └─Conv2d: 3-11                 [1, 256, 61, 61]          (295,168)
│    │    └─ReLU: 3-12                   [1, 256, 61, 61]          --
│    │    └─Conv2d: 3-13                 [1, 256, 61, 61]          (590,080)
│    │    └─ReLU: 3-14                   [1, 256, 61, 61]          --
│    │    └─Conv2d: 3-15                 [1, 256, 61, 61]          (590,080)
│    │    └─ReLU: 3-16                   [1, 256, 61, 61]          --
│    │    └─MaxPool2d: 3-17              [1, 256, 30, 30]          --
│    │    └─Conv2d: 3-18                 [1, 512, 30, 30]          (1,180,160)
│    │    └─ReLU: 3-19                   [1, 512, 30, 30]          --
│    │    └─Conv2d: 3-20                 [1, 512, 30, 30]          (2,359,808)
│    │    └─ReLU: 3-21                   [1, 512, 30, 30]          --
│    │    └─Conv2d: 3-22                 [1, 512, 30, 30]          (2,359,808)
│    │    └─ReLU: 3-23                   [1, 512, 30, 30]          --
│    │    └─MaxPool2d: 3-24              [1, 512, 15, 15]          --
│    │    └─Conv2d: 3-25                 [1, 512, 15, 15]          (2,359,808)
│    │    └─ReLU: 3-26                   [1, 512, 15, 15]          --
│    │    └─Conv2d: 3-27                 [1, 512, 15, 15]          (2,359,808)
│    │    └─ReLU: 3-28                   [1, 512, 15, 15]          --
│    │    └─Conv2d: 3-29                 [1, 512, 15, 15]          (2,359,808)
│    │    └─ReLU: 3-30                   [1, 512, 15, 15]          --
│    │    └─MaxPool2d: 3-31              [1, 512, 7, 7]            --
│    └─AdaptiveAvgPool2d: 2-2            [1, 512, 7, 7]            --
│    └─Sequential: 2-3                   [1, 2]                    --
│    │    └─Linear: 3-32                 [1, 2]                    50,178
│    │    └─Softmax: 3-33                [1, 2]                    --     --
==========================================================================================
Total params: 14,764,866
Trainable params: 50,178
Non-trainable params: 14,714,688
Total mult-adds (G): 17.99
==========================================================================================

this model contain around 15 million total parameters, but only 50k of them are trainable - those are the weights of classification layer.

path = 'data/cats_dogs.pth'
net = TransferVgg16()
summary(net, input_size=(1, 3, 244, 244))
sub_dataset(2000)
hist = train(net, training_dataloader, test_dataloader)
plot_acc_loss(hist)
torch.save(net, path)

epoch: 10
------------------------------------------
train loss: 0.315983   [0/1500]
train loss: 0.319530   [320/1500]
train loss: 0.315883   [640/1500]
train loss: 0.315727   [960/1500]
train loss: 0.315908   [1280/1500]
train_loss: 0.323956     train_acc: 0.999333     1500
val_loss: 0.322157     val_acc: 0.994000     500

Fine-tuning transfer learning

In the previous section, we have trained the final classifier layer to classify images in our own dataset. However, we did not re-train the feature extractor, and our model relied on the features that the model has learned on ImageNet data. If your objects visually differ from ordinary ImageNet images, this combination of features might not work best. Thus it makes sense to start training convolutional layers as well.

we can unfreeze the convolutional filter parameters that we have previously frozen.

Other computer vision models

VGG-16 is one of the simplest computer vision architectures. torchvision package provides many more pre-trained networks. The most frequently used ones among those are ResNet architectures, developed by Microsoft, and Inception by Google.