Building the model layers

Build a neural network

take a sample minibatch of 3 images of size 28x28

1
2
input_image = torch.rand(3,28,28)
print(input_image.size())

nn.Flatten

the nn.Flatten layer to convert each 2D 28x28 image into a contiguous array of 784 pixel values (the minibatch dimension (at dim=0) is maintained).

1
2
flatten = nn.Flatten()
flat_image = flatten(input_image)

nn.Linear

applies a linear transformation on the input using it's stored weights and biases.

1
2
layer = nn.Linear()
hidden = layer(flat_image)

nn.ReLU

Non-linear activations are what create the complex mappings between the model's inputs and outputs. They are applied after linear transformations to introduce nonlinearity, helping neural networks learn a wide variety of phenomena.

1
hidden = nn.ReLu()(hidden)

nn.Sequential

nn.Sequential is an ordered container of modules. The data is passed through all the modules in the same order as defined. You can use sequential containers to put together a quick network like seq_modules.

1
2
3
4
5
6
7
8
seq_modules = nn.Sequential(
nn.Flatten(),
nn.Linear(in_features=28 * 28, out_features=20),
nn.ReLU(),
nn.Linear(in_features=20, out_features=10)
)
input_image = torch.rand(3,28,28)
logits = seq_modules(input_image)

nn.Softmax

The last linear layer of the neural network returns logits - raw values in [-infty, infty] - which are passed to the nn.Softmax module. The logits are scaled to values [0, 1] representing the model's predicted densities for each class. dim parameter indicates the dimension along which the values must sum to 1.

1
2
pred_probab = nn.Softmax(dim=1)(logits)
pred_y = pred_probab.argmax(dim=1)

Model parameters

Many layers inside a neural network are parameterized, i.e. have associated weights and biases that are optimized during training. Subclassing nn.Module automatically tracks all fields defined inside your model object, and makes all parameters accessible using your model's parameters() or named_parameters() methods.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
print(seq_modules)
for name, param in seq_modules.named_paramters():
print(f"Layer: {name} | Size: {param.size()} | Value: {param.data} \n")

Network(
(flatten): Flatten(start_dim=1, end_dim=-1)
(linear_relu_stack): Sequential(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=10, bias=True)
(5): ReLU()
)
)

Automatic differentiation

back propagation:parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer neural network, with input x, parameters w and b, and some loss function.

1

In this network, w and b are parameters, which we need to optimize.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
X = torch.rand(5)
Y = torch.zeros(3)
W = torch.rand(5,3, requires_grad=True)
b = torch.rand(3, requires_grad=True)
Z = torch.matmul(X,W)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(Z,Y)

#######################################################################################
>>> X
tensor([0.8609, 0.8016, 0.8709, 0.1333, 0.6258])
>>> W
tensor([[0.0608, 0.7731, 0.2732],
[0.9440, 0.6451, 0.7805],
[0.0341, 0.4740, 0.4896],
[0.7219, 0.7725, 0.3819],
[0.7504, 0.3699, 0.0386]], requires_grad=True)
>>> b
tensor([0.1596, 0.4110, 0.6390], requires_grad=True)
>>> Y
tensor([0.8099, 0.3342, 0.7097])
>>> Z
tensor([1.5642, 2.3410, 2.0013], grad_fn=<AddBackward0>)
>>> loss
tensor(0.9485, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)

Thus, we need to be able to compute the gradients of loss function with respect to those variables. set the requires_grad property of those tensors.

Note: You can set the value of requires_grad when creating a tensor, or later by using x.requires_grad_(True) method.

Computing gradients

compute the derivatives of our loss function with respect to parameters, namely, we need \(\frac{∂loss}{∂w}\) and \(\frac{∂loss}{∂b}\) under some fixed values of x and y. To compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad:

1
2
3
4
5
6
7
8
9
10
loss.backward()

>>> W.grad
tensor([[0.0049, 0.1659, 0.0491],
[0.0045, 0.1544, 0.0458],
[0.0049, 0.1678, 0.0497],
[0.0008, 0.0257, 0.0076],
[0.0035, 0.1206, 0.0357]])
>>> b.grad
tensor([0.0057, 0.1927, 0.0571])

Note: we can only perform gradient calculations using backward once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass retain_graph=True to the backward call.

Disabling gradient tracking

By default, all tensors with requires_grad=True are tracking their computational history and support gradient computation.We can stop tracking computations by surrounding our computation code with torch.no_grad() block:

1
2
3
Z = torch.matmul(X, W)+b
with torch.no_grad():
Z = torch.matmul(X, W)+b

or use the detach() method on the tensor:

1
2
3
Z = torch.matmul(x, w)+b
Z.detach_()
# or Z = Z.detach()

There are reasons ( disable gradient tracking ) :

  • To mark some parameters in your neural network at frozen parameters. This is a very common scenario for fine tuning a pre-trained network
  • To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.

TIPS:

Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

  • run the requested operation to compute a resulting tensor
  • maintain the operation’s gradient function in the DAG.

The backward pass kicks off when .backward() is called on the DAG root. autograd then:

  • **computes the gradients from each .grad_fn, (A reference to the backward propagation function is stored in grad_fn property of a tensor. )
  • accumulates them in the respective tensor’s .grad attribute
  • using the chain rule, propagates all the way to the leaf tensors.

DAGs are dynamic in PyTorch

the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.

Optional reading: Tensor gradients and Jacobian products

In many cases, we have a scalar loss function, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is an arbitrary tensor. In this case, PyTorch allows you to compute so-called *Jacobian product*, and not the actual gradient.

For a vector function \(\vec{y}=f(\vec{x})\), where \(\vec{x}=\langle x_1,\dots,x_n\rangle\) and \(\vec{y}=\langle y_1,\dots,y_m\rangle\), a gradient of \(\vec{y}\) with respect to \(\vec{x}\) is given by Jacobian matrix:

\[ J=\left(\begin{array}{ccc} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\ \vdots & \ddots & \vdots\\ \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{array}\right) \]

Instead of computing the Jacobian matrix itself, PyTorch allows you to compute Jacobian Product \(v^T\cdot J\) for a given input vector \(v=(v_1 \dots v_m)\). This is achieved by calling backward with \(v\) as an argument. The size of \(v\) should be the same as the size of the original tensor, with respect to which we want tocompute the product:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> x1=torch.tensor(1, requires_grad=True, dtype = torch.float)
>>> x2=torch.tensor(2, requires_grad=True, dtype = torch.float)
>>> x3=torch.tensor(3, requires_grad=True, dtype = torch.float)
>>> y=torch.randn(3) # produce a random vector for vector function define
>>> y[0]=x1**2+2*x2+x3 # define each vector function
>>> y[1]=x1+x2**3+x3**2
>>> y[2]=2*x1+x2**2+x3**3
>>> y.backward(torch.ones(3))
>>> x1.grad
tensor(5.)
>>> x2.grad
tensor(18.)
>>> x3.grad
tensor(34.)

Jacobian Motrix : \[ J = \begin{pmatrix} 2x_1 & 2 &1 \\ 2 & 3x_2^2 & 2x_3\\ 1 & 2x_2 & 3x_3^2 \end{pmatrix} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \left\{\begin{matrix} y_1 = & x_1^2+2x_2+x_3 \\ y_2= & x_1+x_2^3+ x_3^2 \\ y_3 = & 2x_1 +x_2^2+x_3^3 \end{matrix}\right. \]

vector v : \[ v = (1,1,1) \] V * J : \[ v \circ J = (2x_1+2+1,\ \ \ 2+3x_2^2+2x_2,\ \ \ 1+2x_3+3x_3^2)=(5,18,34) \] The above is essentially the directional derivative

1
2
3
4
5
6
7
8
9
10
11
12
13
>>> inp = torch.eye(3,3, requires_grad=True)
>>> out = (inp+1).pow(2)
>>> out.backward(torch.ones_like(inp))
>>> inp.grad
tensor([[4., 2., 2.],
[2., 4., 2.],
[2., 2., 4.]])

>>> inp.grad.zero_()
tensor([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])

Optimizing the model parameters

Training a model is an iterative process:

  • in each iteration (called an epoch) the model makes a guess about the output
  • calculates the error in its guess (loss), collects the derivatives of the error with respect to its parameters (as we saw in the module)
  • optimizes these parameters using gradient descent.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class DataPreparetion:
def __init__(self):
pass

@staticmethod
def data_load(if_download=False):
training_data = datasets.FashionMNIST(
root="data",
train=True,
download=if_download,
transform=ToTensor(),
# To One-Hot encoded (Not applicable for nn.CrossEntropyLoss)
target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float)
.scatter_(dim=0, index=torch.tensor(y), value=1))
)
test_data = datasets.FashionMNIST(
root="data",
train=False,
download=if_download,
transform=ToTensor(),
# To One-Hot encoded
target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float)
.scatter_(dim=0, index=torch.tensor(y), value=1))
)
return training_data, test_data




class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28 * 28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
nn.ReLU()
)

def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits

def information(self):
print(f"model structure: {self}")
for name, param in self.named_parameters():
print(f"Layer: {name} | Size: {param.size()} | Value: {param.data} \n")

Setting hyperparameters

define the following hyperparameters for training:

  • Number of Epochs - the number times to iterate over the dataset
  • Batch Size - the number of data samples seen by the model in each epoch
  • Learning Rate - how much to update models parameters at each batch/epoch. Smaller values yield slow learning speed, while large values may result in unpredictable behavior during training.
1
2
3
learning_rate = 1e-3
batch_size = 64
epochs = 5

Add an optimization loop

Each epoch consists of two main parts:

  • The Train Loop - iterate over the training dataset and try to converge to optimal parameters.
  • The Validation/Test Loop - iterate over the test dataset to check if model performance is improving.

Add a loss function

Loss function measures the degree of dissimilarity of obtained result to the target value, and it is the loss function that we want to minimize during training. To calculate the loss we make a prediction using the inputs of our given data sample and compare it against the true data label value.

Common loss functions include

We pass our model's output logits to nn.CrossEntropyLoss, which will normalize the logits and compute the prediction error.

Note: The input is expected to contain raw, unnormalized scores for each class.

1
2
# Initialize the loss function
loss_fn = nn.CrossEntropyLoss()

Optimization pass

Optimization algorithms define how this process is performed (in this example we use Stochastic Gradient Descent). All optimization logic is encapsulated in the optimizer object. Here, we use the SGD optimizer; additionally, there are many different optimizers available in PyTorch such as ADAM and RMSProp, that work better for different kinds of models and data.

1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

optimization happens in three steps:

  • Call optimizer.zero_grad() to reset the gradients of model parameters. Gradients by default add up; to prevent double-counting, we explicitly zero them at each iteration.
  • Back-propagate the prediction loss with a call to loss.backwards(). PyTorch deposits the gradients of the loss w.r.t. each parameter.
  • Once we have our gradients, we call optimizer.step() to adjust the parameters by the gradients collected in the backward pass.

Full implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# Get hardware device for training
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using {device} device")

model = Network().to(device)
print(model.parameters())
# model.information()

# Loss Function
loss_fn = nn.CrossEntropyLoss()

# Optimization Algorithms: Stochastic Gradient Descent
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# DataLoader
training_data, test_data = DataPreparetion().data_load()
training_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)


def train_loop(dataloader):
size = len(dataloader.dataset)
for batch, (X, y) in enumerate(training_dataloader):
# Compute prediction and loss
pred_y = model(X)
loss = loss_fn(pred_y, y)

# Backpropagation
optimizer.zero_grad()
loss.backward()

# adjust the parameters by the gradients collected in the backward pass
optimizer.step()

if batch % 100 == 0:
loss = loss.item()
current = batch * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")


def test_loop(dataloader):
size = len(dataloader.dataset)
test_loss, correct = 0, 0
with torch.no_grad:
for X, y in dataloader:
pred_probab = model(X)
pred_y = pred_probab.argmax(1)
test_loss += loss_fn(pred_probab, y)
correct += (pred_y == y).type(torch.float).sum().item()
test_loss /= size
correct /= size
print(f"Test Error: \n Accuracy: {(100 * correct):>0.1f}%, "
f"Avg loss: {test_loss:>8f} \n")


######################################################################
epoch: 10
----------------------------------
loss: 1.258381 [ 0/60000]
loss: 1.305664 [ 6400/60000]
loss: 1.226043 [12800/60000]
loss: 1.363006 [19200/60000]
loss: 1.471584 [25600/60000]
loss: 1.185646 [32000/60000]
loss: 1.068239 [38400/60000]
loss: 1.386577 [44800/60000]
loss: 1.405703 [51200/60000]
loss: 1.409026 [57600/60000]
Test Error:
Accuracy: 59.1%, Avg loss: 0.019455

done

Save and load the model

Saving and loading model weights

PyTorch models store the learned parameters in an internal state dictionary, called state_dict. These can be persisted via the torch.save method:

1
torch.save(model.state_dict(), 'data/model_weights.pth')

To load model weights, you need to create an instance of the same model first, and then load the parameters using the load_state_dict() method.

1
2
model.load_state_dict(torch.load('data/model_weights.pth'))
model.eval()

Note: Be sure to call model.eval() method before inferencing to set the dropout and batch normalization layers to evaluation mode. Failing to do this will yield inconsistent inference results.

Saving and loading models with shapes

1
2
3
4
torch.save(model, 'data/model.pth')

model = torch.load('data/model.pth')
optimizers = torch.optim.SGD(model.parameters(), lr=learning_rate)

Note: This approach uses Python pickle module when serializing the model, thus it relies on the actual class definition to be available when loading the model.

Exporting the model to ONNX

PyTorch also has native ONNX export support. Given the dynamic nature of the PyTorch execution graph, however, the export process must traverse the execution graph to produce a persisted ONNX model. For this reason, a test variable of the appropriate size should be passed in to the export routine (in our case, we will create a dummy zero tensor of the correct size):

1
2
3
4
import torch.onnx as onnx

input_image = torch.zeros(1,3, 224, 224)
onnx.export(model, input_image, 'data/model.onnx')