Datasets & DataLoaders & Transforms

发表于 2021-06-18 更新于 2021-06-24 分类于 PyTorch

本文字数： 7.6k 阅读时长 ≈ 7 分钟

Load data with PyTorch Datasets and DataLoaders

PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

Loading a dataset

Load the Fashion-MNIST dataset from TorchVision. Fashion-MNIST is a dataset of Zalando’s article images consisting of of 60,000 training examples and 10,000 test examples. Each example comprises a 28×28 grayscale image and an associated label from one of 10 classes.The images show individual articles of clothing at low resolution (28 by 28 pixels), as seen here:

The FashionMNIST Dataset with the following parameters:

Parameters + root (string) – Root directory of dataset where FashionMNIST/processed/training.pt and FashionMNIST/processed/test.ptexist.

train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda
import matplotlib.pyplot as plt

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    #ToTensor converts a PIL image or NumPy ndarray into a FloatTensor and 
    #scales the image's pixel intensity values in the range [0., 1.]
    transform=ToTensor()
    #define a function to turn the integer into a one-hot encoded tensor.
    #It first creates a zero tensor of size 10 (the number of labels in our dataset) and 
    #calls scatter which assigns a value=1 on the index as given by the label y.
    target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float)
                                    .scatter_(dim=0, index=torch.tensor(y), value=1))
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
    #define a function to turn the integer into a one-hot encoded tensor.
    #It first creates a zero tensor of size 10 (the number of labels in our dataset) and 
    #calls scatter which assigns a value=1 on the index as given by the label y.
    target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float)
                                    .scatter_(dim=0, index=torch.tensor(y), value=1))
)

Iterating and Visualizing the Dataset

We can index Datasets manually like a list: training_data[index]. We use matplotlib to visualize some samples in our training data.

labels_map = {
            0: "T-Shirt",
            1: "Trouser",
            2: "Pullover",
            3: "Dress",
            4: "Coat",
            5: "Sandal",
            6: "Shirt",
            7: "Sneaker",
            8: "Bag",
            9: "Ankle Boot",
        }
figure = plt.figure(figsize=(8, 8))
training_data, test_data = DataPreparetion.data_load()
cols, rows = 4, 4
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(training_data), size=(1,)).item()
    img, label = training_data[sample_idx]
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray")

plt.show()

Creating a Custom Dataset

A custom Dataset class must implement three functions: __init__, __len__, and __getitem__. Take a look at this implementation; the FashionMNIST images are stored in a directory img_dir, and their labels are stored separately in a CSV file annotations_file.

from torch.utils.data import Dataset

class CustomImageDataset(Dataset):
	# The __init__ function is run once when instantiating the Dataset object.
  # We initialize the directory containing the images, the annotations file,
  # and both transforms (covered in more detail in the next section).
  def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
    self.img_labels = pd.read_csv(annotations_file)
    self.img_dir = img_dir
    self.transform = transform
    self.target_transform = target_transform

  # The __len__ function returns the number of samples in our dataset.
  def __len__(self):
    return len(self.img_labels)

  # The __getitem__ function loads and returns a sample from the dataset at the given index idx.
  # Based on the index, it identifies the image's location on disk, converts that to a tensor using read_image,
  # retrieves the corresponding label from the csv data in self.img_labels,
  # calls the transform functions on them (if applicable),
  # and returns the tensor image and corresponding label in a Python dict.
  def __getitem__(self, idx):
    img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
    image = tvio.read_image(img_path)
    label = self.img_labels.iloc[idx, 1]
    if self.transform:
    image = self.transform(image)
    if self.target_transform:
    label = self.target_transform(label)
    sample = {"image": image, "label": label}
    return sample

Preparing your data for training with DataLoaders

The Dataset retrieves dataset's features and labels one sample at a time. While training a model, we typically want to pass samples in "minibatches", reshuffle the data at every epoch to reduce model overfitting, and use Python's multiprocessing to speed up data retrieval.

DataLoader is an iterable that abstracts this complexity for us in an easy API.

from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

Iterate through the DataLoader

We have loaded that dataset into the Dataloader and can iterate through the dataset as needed. Each iteration below returns a batch of train_features and train_labels(containing batch_size=64 features and labels respectively).

Because shuffle=True, after we iterate over all batches the data is shuffled (for finer-grained control over the data loading order, take a look at Samplers.)

training_data, test_data = DataPreparetion.data_load()
train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}
for train_features, train_labels in iter(train_dataloader):
    print(f"feature size: {train_features.size()}")
    print(f"label size: {train_labels.size()}")

    figure = plt.figure(figsize=(28, 28))
    for i in range(1, train_features.size()[0]):
        img = train_features[i].squeeze()
        label = train_labels[i].squeeze()
        figure.add_subplot(8,8,i)
        plt.title(labels_map[label.item()])
        plt.axis("off")
        plt.imshow(img.squeeze(), cmap="gray")
        print(f"label: {label.item()}")
    plt.show()

Transforms

Data does not always come in its final processed form that is required for training machine learning algorithms. We use transforms to perform some manipulation of the data and make it suitable for training.

All TorchVision datasets have two parameters (transform to modify the features and target_transform to modify the labels) that accept callables containing the transformation logic.

The FashionMNIST features are in PIL Image format, and the labels are integers. For training, we need the features as normalized tensors, and the labels as one-hot encoded tensors. To make these transformations, we use ToTensor and Lambda.

#define a function to turn the integer into a one-hot encoded tensor.
#It first creates a zero tensor of size 10 (the number of labels in our dataset) and 
#calls scatter which assigns a value=1 on the index as given by the label y.
target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float)
                        .scatter_(dim=0, index=torch.tensor(y), value=1))
)
################################################################################################
example：
>>> y = lambda y: torch.zeros(10, dtype=torch.float).scatter_(dim=0, index=torch.tensor(y), value=1)
>>> y(3)
tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0.])