Yolov5程式碼解析(輸入端、BackBone、Neck、輸出端))

【深度學習】總目錄

輸入端：資料增強、錨框計算等。
backbone：進行特徵提取。常用的骨幹網路有VGG，ResNet，DenseNet，MobileNet，EfficientNet，CSPDarknet 53，Swin Transformer等。（其中yolov5s採用CSPDarknet 53作為骨幹網）應用到不同場景時，可以對模型進行微調，使其更適用於特定的場景。
neck：neck的設計是為了更好的利用backbone提取的特徵，在不同階段對backbone提取的特徵圖進行在加工和合理利用。常用的結構有FPN，PANet，NAS-FPN，BiFPN，ASFF，SFAM等。（其中yolov5採用PAN結構）共同點是反覆使用各種上下取樣，拼接，點和和點積來設計聚合策略。
Head：骨幹網作為一個分類網路，無法完成定位任務，Head通過骨幹網提取的特徵圖來檢測目標的位置和類別。

1 輸入端

1.1 資料增強

LoadImagesAndLabels類自定義了資料集的處理過程，該類繼承pytorch的Dataset類，需要實現父類別的__init__方法， __getitem__方法和__len__方法，在每個step訓練的時候，DataLodar迭代器通過__getitem__方法獲取一批訓練資料。自定義資料集的重點是 __getitem__函數，各種資料增強的方式就是在這裡進行的。

1.1.1 MixUp資料增強

論文(ICLR2018收錄)：mixup: BEYOND EMPIRICAL RISK MINIMIZATION

Mixup資料增強核心思想是從每個Batch中隨機選擇兩張圖片，並以一定比例混合生成新的影象，訓練過程全部採用混合的新影象訓練，原始影象不再參與訓練。

假設影象1座標為(xi,yi)，影象2座標為(xj,yj)，混合影象座標為(x',y')，則混合公式如下：

λ∈[0,1]，為服從Beta分佈（引數都為α）的亂數。

從原文實驗結果中可以看出，mixup在ImageNet-2012上面經過200 epoch後在幾個網路上提高了1.2 ~ 1.5個百分點。在CIFAR-10上提高1.0 ~ 1.4個百分點，在CIFAR-100上提高1.9 ~ 4.5個百分點。

Yolov5中的mixup實現

def mixup(im, labels, im2, labels2):
    # Applies MixUp augmentation https://arxiv.org/pdf/1710.09412.pdf
    r = np.random.beta(32.0, 32.0)  # mixup ratio, alpha=beta=32.0
    im = (im * r + im2 * (1 - r)).astype(np.uint8)  # 混合影象
    labels = np.concatenate((labels, labels2), 0)  # 標籤直接concate更加簡單
    return im, labels

1.1.2 Cutout資料增強

Cutout論文：Improved Regularization of Convolutional Neural Networks with Cutout

CNN具有非常強大的能力，然而，由於它的學習能力非常強，有時會導致過擬合現象的出現。為了解決這個問題，文章提出了一種簡單的正則化方法：cutout。它的原理是在訓練時隨機地遮蔽輸入影象中的方形區域。類似於dropout，但有兩個主要的區別：（1）它丟棄的是輸入影象的資料。（2）它丟棄的是一整塊區域，而不是單個神經元。這能夠有效地幫助CNN關注不同的特徵，因為去除一個區域的神經元可以很好地防止被去除的神經元資訊通過其它渠道向下傳遞。同時，dropout由於（1）折積層擁有相較於全連線層更少的引數，因此正則化的效果相對欠佳；（2）影象的相鄰元素有著很強的相關性的原因，在折積層的效果不好。而cutout因為去除了一塊區域的神經元，且它相比更接近於資料增強。因此在折積層的效果要相對更好。cutout不僅容易實現，且實驗證明，它能夠與其它的資料增強方法一起作用，來提高模型的表現。作者發現，比起形狀，cutout區域的大小更為重要。因此為了簡化，他們選擇了方形，且如果允許cutout區域延伸到影象外，效果反而會更好。

Yolov5中的cutout實現（預設不啟用）

def cutout(im, labels, p=0.5):
    # Applies image cutout augmentation https://arxiv.org/abs/1708.04552
    if random.random() < p:
        h, w = im.shape[:2]
        scales = [0.5] * 1 + [0.25] * 2 + [0.125] * 4 + [0.0625] * 8 + [0.03125] * 16  # image size fraction
        for s in scales:
            mask_h = random.randint(1, int(h * s))  # create random masks
            mask_w = random.randint(1, int(w * s))

            # box
            xmin = max(0, random.randint(0, w) - mask_w // 2)
            ymin = max(0, random.randint(0, h) - mask_h // 2)
            xmax = min(w, xmin + mask_w)
            ymax = min(h, ymin + mask_h)

            # apply random color mask
            im[ymin:ymax, xmin:xmax] = [random.randint(64, 191) for _ in range(3)]

            # return unobscured labels
            if len(labels) and s > 0.03:
                box = np.array([xmin, ymin, xmax, ymax], dtype=np.float32)
                ioa = bbox_ioa(box, labels[:, 1:5])  # intersection over area
                labels = labels[ioa < 0.60]  # remove >60% obscured labels

    return labels

CutMix

CutMix論文：CutMix:Regularization Strategy to Train Strong Classifiers with Localizable Features

mixup：混合後的影象在區域性是模糊和不自然的，因此會混淆模型，尤其是在定位方面。
cutout：被cutout的部分通常用0或者隨機噪聲填充，這就導致在訓練過程中這部分的資訊被浪費掉了。

cutmix在cutout的基礎上進行改進，cutout的部分用另一張影象上cutout的部分進行填充，這樣即保留了cutout的優點：讓模型從目標的部分檢視去學習目標的特徵，讓模型更關注那些less discriminative的部分。同時比cutout更高效，cutout的部分用另一張影象的部分進行填充，讓模型同時學習兩個目標的特徵。

1.1.3 Mosaic資料增強

Mosaic是YOLOV4中提出的新方法，參考2019年底提出的CutMix資料增強的方式，但CutMix只使用了兩張圖片進行拼接，而Mosaic資料增強則採用了4張圖片，通過隨機縮放、隨機裁減、隨機排布的方式進行拼接。Mosaic有如下優點：

（1）豐富資料集：隨機使用4張圖片，隨機縮放，再隨機分佈進行拼接，大大豐富了檢測資料集，特別是隨機縮放增加了很多小目標，讓網路的魯棒性更好；
（2）減少GPU視訊記憶體：直接計算4張圖片的資料，使得Mini-batch大小並不需要很大就可以達到比較好的效果。

初始化整個背景圖, 大小為(2 × image_size, 2 × image_size, 3)
保留一些邊緣留白，隨機取一箇中心點
基於中心點分別將4個圖放到左上、右上、左下、右下，此部分可能會出現小圖出界的情況，所以拼接的時候可能會進行裁剪
計算真實框的偏移量，在大圖中重新計算框的位置

Yolov5中的4-mosaic和9-mosaic實現

    def load_mosaic(self, index):
        # YOLOv5 4-mosaic loader. Loads 1 image + 3 random images into a 4-image mosaic
        labels4, segments4 = [], []
        s = self.img_size
        yc, xc = (int(random.uniform(-x, 2 * s + x)) for x in self.mosaic_border)  # mosaic center x, y　有範圍限制，左右留下mosaic_border大小邊界
        indices = [index] + random.choices(self.indices, k=3)  # 在所有圖片中隨機選擇三張
        random.shuffle(indices)
        for i, index in enumerate(indices):
            # Load image
            img, _, (h, w) = self.load_image(index)

            # place img in img4
            if i == 0:  # top left
                img4 = np.full((s * 2, s * 2, img.shape[2]), 114, dtype=np.uint8)  # base image with 4 tiles (np.full用固定值填充2s*2s的大圖）
                x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc  # xmin, ymin, xmax, ymax (large image)
                x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h  # xmin, ymin, xmax, ymax (small image)
            elif i == 1:  # top right
                x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc
                x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h
            elif i == 2:  # bottom left
                x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h)
                x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h)
            elif i == 3:  # bottom right
                x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h)  # 在大圖中每張小圖的位置
                x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h)  # 每張小圖中對應大小的區域

            img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]  # img4[ymin:ymax, xmin:xmax] 將小圖拷貝到大圖中對應位置
            padw = x1a - x1b  # 小圖的左上角點相對於大圖左上角點的偏移(padw, padh)，用來計算mosaic增強後的標籤框的位置
            padh = y1a - y1b

            # Labels
            labels, segments = self.labels[index].copy(), self.segments[index].copy()
            if labels.size:
                labels[:, 1:] = xywhn2xyxy(labels[:, 1:], w, h, padw, padh)  # normalized xywh to pixel xyxy format
                segments = [xyn2xy(x, w, h, padw, padh) for x in segments]
            labels4.append(labels)
            segments4.extend(segments)

        # Concat/clip labels
        labels4 = np.concatenate(labels4, 0)
        for x in (labels4[:, 1:], *segments4):
            np.clip(x, 0, 2 * s, out=x)  # clip when using random_perspective()
        # img4, labels4 = replicate(img4, labels4)  # replicate

        # Augment
        img4, labels4, segments4 = copy_paste(img4, labels4, segments4, p=self.hyp['copy_paste'])
        img4, labels4 = random_perspective(img4,
                                           labels4,
                                           segments4,
                                           degrees=self.hyp['degrees'],
                                           translate=self.hyp['translate'],
                                           scale=self.hyp['scale'],
                                           shear=self.hyp['shear'],
                                           perspective=self.hyp['perspective'],
                                           border=self.mosaic_border)  # border to remove

        return img4, labels4

    def load_mosaic9(self, index):
        # YOLOv5 9-mosaic loader. Loads 1 image + 8 random images into a 9-image mosaic
        labels9, segments9 = [], []
        s = self.img_size
        indices = [index] + random.choices(self.indices, k=8)  # 8 additional image indices
        random.shuffle(indices)
        hp, wp = -1, -1  # height, width previous
        for i, index in enumerate(indices):
            # Load image
            img, _, (h, w) = self.load_image(index)

            # place img in img9
            if i == 0:  # center
                img9 = np.full((s * 3, s * 3, img.shape[2]), 114, dtype=np.uint8)  # base image with 4 tiles
                h0, w0 = h, w
                c = s, s, s + w, s + h  # xmin, ymin, xmax, ymax (base) coordinates
            elif i == 1:  # top
                c = s, s - h, s + w, s
            elif i == 2:  # top right
                c = s + wp, s - h, s + wp + w, s
            elif i == 3:  # right
                c = s + w0, s, s + w0 + w, s + h
            elif i == 4:  # bottom right
                c = s + w0, s + hp, s + w0 + w, s + hp + h
            elif i == 5:  # bottom
                c = s + w0 - w, s + h0, s + w0, s + h0 + h
            elif i == 6:  # bottom left
                c = s + w0 - wp - w, s + h0, s + w0 - wp, s + h0 + h
            elif i == 7:  # left
                c = s - w, s + h0 - h, s, s + h0
            elif i == 8:  # top left
                c = s - w, s + h0 - hp - h, s, s + h0 - hp

            padx, pady = c[:2]
            x1, y1, x2, y2 = (max(x, 0) for x in c)  # allocate coords

            # Labels
            labels, segments = self.labels[index].copy(), self.segments[index].copy()
            if labels.size:
                labels[:, 1:] = xywhn2xyxy(labels[:, 1:], w, h, padx, pady)  # normalized xywh to pixel xyxy format
                segments = [xyn2xy(x, w, h, padx, pady) for x in segments]
            labels9.append(labels)
            segments9.extend(segments)

            # Image
            img9[y1:y2, x1:x2] = img[y1 - pady:, x1 - padx:]  # img9[ymin:ymax, xmin:xmax]
            hp, wp = h, w  # height, width previous

        # Offset
        yc, xc = (int(random.uniform(0, s)) for _ in self.mosaic_border)  # mosaic center x, y
        img9 = img9[yc:yc + 2 * s, xc:xc + 2 * s]

        # Concat/clip labels
        labels9 = np.concatenate(labels9, 0)
        labels9[:, [1, 3]] -= xc
        labels9[:, [2, 4]] -= yc
        c = np.array([xc, yc])  # centers
        segments9 = [x - c for x in segments9]

        for x in (labels9[:, 1:], *segments9):
            np.clip(x, 0, 2 * s, out=x)  # clip when using random_perspective()
        # img9, labels9 = replicate(img9, labels9)  # replicate

        # Augment
        img9, labels9 = random_perspective(img9,
                                           labels9,
                                           segments9,
                                           degrees=self.hyp['degrees'],
                                           translate=self.hyp['translate'],
                                           scale=self.hyp['scale'],
                                           shear=self.hyp['shear'],
                                           perspective=self.hyp['perspective'],
                                           border=self.mosaic_border)  # border to remove

        return img9, labels9

切換使用

mosaic = self.mosaic and random.random() < hyp['mosaic']
if mosaic:
	# Load mosaic
	img, labels = load_mosaic(self, index)  # use load_mosaic4
	# img, labels = load_mosaic9(self, index)   # use load_mosaic9
	shapes = None
	
	# MixUp augmentation
	if random.random() < hyp['mixup']:
	    img, labels = mixup(img, labels, *load_mosaic(self, random.randint(0, self.n - 1)))
	    # img, labels = mixup(img, labels, *load_mosaic9(self, random.randint(0, self.n - 1)))

1.1.4 Copy paste資料增強

論文：Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

中文名叫複製貼上大法，將部分目標隨機的貼上到圖片中，前提是資料要有segments資料才行，即每個目標的範例分割資訊。

在COCO範例分割上，實現了49.1%mask AP和57.3%box AP，與之前的最新技術相比，分別提高了+0.6%mask AP和+1.5%box AP。

Yolov5中的copy paste實現

def copy_paste(im, labels, segments, p=0.5):
    # Implement Copy-Paste augmentation https://arxiv.org/abs/2012.07177, labels as nx5 np.array(cls, xyxy)
    n = len(segments)
    if p and n:
        h, w, c = im.shape  # height, width, channels
        im_new = np.zeros(im.shape, np.uint8)
        for j in random.sample(range(n), k=round(p * n)):
            l, s = labels[j], segments[j]
            box = w - l[3], l[2], w - l[1], l[4]
            ioa = bbox_ioa(box, labels[:, 1:5])  # intersection over area
            if (ioa < 0.30).all():  # allow 30% obscuration of existing labels
                labels = np.concatenate((labels, [[l[0], *box]]), 0)
                segments.append(np.concatenate((w - s[:, 0:1], s[:, 1:2]), 1))
                cv2.drawContours(im_new, [segments[j].astype(np.int32)], -1, (255, 255, 255), cv2.FILLED)

        result = cv2.bitwise_and(src1=im, src2=im_new)
        result = cv2.flip(result, 1)  # augment segments (flip left-right)
        i = result > 0  # pixels to replace
        # i[:, :] = result.max(2).reshape(h, w, 1)  # act over ch
        im[i] = result[i]  # cv2.imwrite('debug.jpg', im)  # debug

    return im, labels, segments

1.1.5 Random affine仿射變換

在yolov5中Mosaic資料增強部分的程式碼包括了仿射變換，如果部採用Mosaic資料增強也會單獨進行仿射變換。yolov5的仿射變換包含隨機旋轉、平移、縮放、錯切（將所有點沿某一指定方向成比例地平移）、透視操作，根據hyp.scratch-low.yaml，預設情況下只使用了Scale和Translation即縮放和平移。通過degrees設定圖片旋轉角度，perspective、shear設定透視變換和錯切。

Yolov5中的random_perspective實現

def random_perspective(im,
                       targets=(),
                       segments=(),
                       degrees=10,
                       translate=.1,
                       scale=.1,
                       shear=10,
                       perspective=0.0,
                       border=(0, 0)):
    # torchvision.transforms.RandomAffine(degrees=(-10, 10), translate=(0.1, 0.1), scale=(0.9, 1.1), shear=(-10, 10))
    # targets = [cls, xyxy]

    height = im.shape[0] + border[0] * 2  # shape(h,w,c)
    width = im.shape[1] + border[1] * 2

    # Center
    C = np.eye(3)
    C[0, 2] = -im.shape[1] / 2  # x translation (pixels)
    C[1, 2] = -im.shape[0] / 2  # y translation (pixels)

    # Perspective
    P = np.eye(3)
    P[2, 0] = random.uniform(-perspective, perspective)  # x perspective (about y)
    P[2, 1] = random.uniform(-perspective, perspective)  # y perspective (about x)

    # Rotation and Scale
    R = np.eye(3)
    a = random.uniform(-degrees, degrees)
    # a += random.choice([-180, -90, 0, 90])  # add 90deg rotations to small rotations
    s = random.uniform(1 - scale, 1 + scale)
    # s = 2 ** random.uniform(-scale, scale)
    R[:2] = cv2.getRotationMatrix2D(angle=a, center=(0, 0), scale=s)

    # Shear
    S = np.eye(3)
    S[0, 1] = math.tan(random.uniform(-shear, shear) * math.pi / 180)  # x shear (deg)
    S[1, 0] = math.tan(random.uniform(-shear, shear) * math.pi / 180)  # y shear (deg)

    # Translation
    T = np.eye(3)
    T[0, 2] = random.uniform(0.5 - translate, 0.5 + translate) * width  # x translation (pixels)
    T[1, 2] = random.uniform(0.5 - translate, 0.5 + translate) * height  # y translation (pixels)

    # Combined rotation matrix
    M = T @ S @ R @ P @ C  # order of operations (right to left) is IMPORTANT
    if (border[0] != 0) or (border[1] != 0) or (M != np.eye(3)).any():  # image changed
        if perspective:
            im = cv2.warpPerspective(im, M, dsize=(width, height), borderValue=(114, 114, 114))# 透視變換
        else:  # affine
            im = cv2.warpAffine(im, M[:2], dsize=(width, height), borderValue=(114, 114, 114))# 仿射變換

    # Visualize
    # import matplotlib.pyplot as plt
    # ax = plt.subplots(1, 2, figsize=(12, 6))[1].ravel()
    # ax[0].imshow(im[:, :, ::-1])  # base
    # ax[1].imshow(im2[:, :, ::-1])  # warped
    
    # Transform label coordinates
    n = len(targets)
    if n:
        use_segments = any(x.any() for x in segments)
        new = np.zeros((n, 4))
        if use_segments:  # warp segments
            segments = resample_segments(segments)  # upsample
            for i, segment in enumerate(segments):
                xy = np.ones((len(segment), 3))
                xy[:, :2] = segment
                xy = xy @ M.T  # transform
                xy = xy[:, :2] / xy[:, 2:3] if perspective else xy[:, :2]  # perspective rescale or affine

                # clip
                new[i] = segment2box(xy, width, height)

        else:  # warp boxes
            xy = np.ones((n * 4, 3))
            xy[:, :2] = targets[:, [1, 2, 3, 4, 1, 4, 3, 2]].reshape(n * 4, 2)  # x1y1, x2y2, x1y2, x2y1
            xy = xy @ M.T  # transform
            xy = (xy[:, :2] / xy[:, 2:3] if perspective else xy[:, :2]).reshape(n, 8)  # perspective rescale or affine

            # create new boxes
            x = xy[:, [0, 2, 4, 6]]
            y = xy[:, [1, 3, 5, 7]]
            new = np.concatenate((x.min(1), y.min(1), x.max(1), y.max(1))).reshape(4, n).T

            # clip
            new[:, [0, 2]] = new[:, [0, 2]].clip(0, width)
            new[:, [1, 3]] = new[:, [1, 3]].clip(0, height)

        # filter candidates 對label進行面積，長寬和長寬比篩選
        i = box_candidates(box1=targets[:, 1:5].T * s, box2=new.T, area_thr=0.01 if use_segments else 0.10)
        targets = targets[i]
        targets[:, 1:5] = new[i]

    return im, targets

1.1.6 HSV隨機增強影象

Yolov5使用hsv增強的目的是令模型在訓練過程中看到的資料更加的多樣，而通過HSV增強獲得的」多樣性「也可以從3個角度來說：

色調（Hue）多樣：通過隨機地調整色調可以模擬不同顏色風格的輸入影象，比如不同濾鏡，不同顏色光照等場景下的影象，從而提升模型在這些場景下的泛化能力；
飽和度（Saturation）多樣：通過隨機調整飽和度可以提升模型對不同鮮豔程度的目標的識別的泛化能力；
亮度（Value）多樣：通過隨機調整亮度可以提升模型應對不同光亮場景下的輸入影象。

HSV增強在目標檢測模型的訓練中是非常常用的方法，它在不破壞影象中關鍵資訊的前提下提高了資料集的豐富程度，且計算成本很低，是很實用的資料增強方法。

Yolov5中的augment_hsv實現

def augment_hsv(im, hgain=0.5, sgain=0.5, vgain=0.5):
    # HSV color-space augmentation
    if hgain or sgain or vgain:
        r = np.random.uniform(-1, 1, 3) * [hgain, sgain, vgain] + 1  # random gains
        hue, sat, val = cv2.split(cv2.cvtColor(im, cv2.COLOR_BGR2HSV)) #由bgr轉為hsv後分離三通道
        dtype = im.dtype  # uint8
        # 建立3個通道的查詢表，將通過查詢表將原值對映為新值
        x = np.arange(0, 256, dtype=r.dtype)
        lut_hue = ((x * r[0]) % 180).astype(dtype)  # opencv中hue值的範圍0~180
        lut_sat = np.clip(x * r[1], 0, 255).astype(dtype)
        lut_val = np.clip(x * r[2], 0, 255).astype(dtype)
        # H，S，V三個通道將原值對映至隨機增減後的值，再合併
        im_hsv = cv2.merge((cv2.LUT(hue, lut_hue), cv2.LUT(sat, lut_sat), cv2.LUT(val, lut_val)))
        cv2.cvtColor(im_hsv, cv2.COLOR_HSV2BGR, dst=im)  # no return needed

1.1.7 隨機水平翻轉

Yolov5中的Flip實現

# Flip up-down
if random.random() < hyp['flipud']:
    img = np.flipud(img)
    if nl:
        labels[:, 2] = 1 - labels[:, 2]

# Flip left-right
if random.random() < hyp['fliplr']:
    img = np.fliplr(img)
    if nl:
         labels[:, 1] = 1 - labels[:, 1]

1.1.8 Albumentations資料增強工具包

Albumentations工具包涵蓋了絕大部分的資料增強方式，使用方法類似於pytorch的transform。不過，在Albumentations提供的資料增強方式比pytorch官方的更多，使用也比較方便。

github地址：https://github.com/albumentations-team/albumentations
docs使用檔案：https://albumentations.ai/docs

YOLOv5的 Albumentations類

class Albumentations:
    # YOLOv5 Albumentations class (optional, only used if package is installed)
    def __init__(self):
        self.transform = None
        try:
            import albumentations as A
            check_version(A.__version__, '1.0.3', hard=True)  # version requirement

            T = [
                A.Blur(p=0.01),                       # 隨機模糊
                A.MedianBlur(p=0.01),                 # 中值濾波器模糊輸入影象
                A.ToGray(p=0.01),                     # 將輸入的 RGB 影象轉換為灰度
                A.CLAHE(p=0.01),                      # 自適應直方圖均衡
                A.RandomBrightnessContrast(p=0.0),    # 隨機改變輸入影象的亮度和對比度
                A.RandomGamma(p=0.0),                 # 隨機伽馬變換
                A.ImageCompression(quality_lower=75, p=0.0),  # 減少影象的 Jpeg、WebP 壓縮

                # 可加
                A.GaussianBlur(p=0.15),               # 高斯濾波器模糊
                A.GaussNoise(p=0.15),                 # 高斯噪聲應用於輸入影象
                A.FancyPCA(p=0.25),                   # PCA來找出R/G/B這三維的主成分，然後隨機增加影象畫素強度（AlexNet）
            ]
            self.transform = A.Compose(T, bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))

            LOGGER.info(colorstr('albumentations: ') + ', '.join(f'{x}' for x in self.transform.transforms if x.p))
        except ImportError:  # package not installed, skip
            pass
        except Exception as e:
            LOGGER.info(colorstr('albumentations: ') + f'{e}')

    def __call__(self, im, labels, p=1.0):
        if self.transform and random.random() < p:
            new = self.transform(image=im, bboxes=labels[:, 1:], class_labels=labels[:, 0])  # transformed
            im, labels = new['image'], np.array([[c, *b] for c, b in zip(new['class_labels'], new['bboxes'])])
        return im, labels

1.2 自適應錨框計算

下面是yolov5 v7.0中的anchor，這是在coco資料集上通過聚類方法得到的。當我們的輸入尺寸為640*640時，會得到3個不同尺度的輸出：80x80（640/8）、40x40（640/16）、20x20（640/32）。其中，80x80代表淺層的特徵圖（P3），包含較多的低層級資訊，適合用於檢測小目標，所以這一特徵圖所用的anchor尺度較小；20x20代表深層的特徵圖（P5），包含更多高層級的資訊，如輪廓、結構等資訊，適合用於大目標的檢測，所以這一特徵圖所用的anchor尺度較大。另外的40x40特徵圖（P4）上就用介於這兩個尺度之間的anchor用來檢測中等大小的目標。對於20*20尺度大小的特徵圖，由原圖下取樣32倍得到，因此先驗框由640*640尺度下的 (116 × 90)， (156 × 198)，(373 × 326) 縮小32倍，變成 (3.625× 2.8125)， (4.875× 6.1875)，(11.6563×10.1875)，其共有13*13個grid cell，則這每個169個grid cell都會被分配3*13*13個先驗框。

在Yolov3、Yolov4中，訓練不同的資料集時，計算初始錨框的值是通過單獨的程式執行的。但Yolov5中將此功能嵌入到程式碼中，每次訓練時，自適應的計算不同訓練集中的最佳錨框值。當然，如果覺得計算的錨框效果不是很好，也可以在train.py中將自動計算錨框功能關閉。

Yolov5的自適應錨框計算函數kmean_anchors（位於utils/autoanchor.py）

def kmean_anchors(path='./data/coco128.yaml', n=9, img_size=640, thr=4.0, gen=1000, verbose=True):
    """在check_anchors中呼叫
    使用K-means + 遺傳演演算法 算出更符合當前資料集的anchors
    Creates kmeans-evolved anchors from training dataset
    :params path: 資料集的路徑/資料集本身
    :params n: anchors 的個數
    :params img_size: 資料集圖片約定的大小
    :params thr: 閾值 由 hyp['anchor_t'] 引數控制
    :params gen: 遺傳演演算法進化迭代的次數(突變 + 選擇)
    :params verbose: 是否列印所有的進化(成功的)結果 預設傳入是False, 只列印最佳的進化結果
    :return k: K-means + 遺傳演演算法進化後的anchors
    """
    from scipy.cluster.vq import kmeans


    # 注意一下下面的thr不是傳入的thr，而是1/thr, 所以在計算指標這方面還是和check_anchor一樣
    thr = 1. / thr  # 0.25
    prefix = colorstr('autoanchor: ')

    def metric(k, wh):  # compute metrics
        """用於 print_results 函數和 anchor_fitness 函數
        計算ratio metric: 整個資料集的  ground truth 框與 anchor 對應寬比和高比即:gt_w/k_w,gt_h/k_h + x + best_x  用於後續計算BPR+aat
        注意我們這裡選擇的metric是 ground truth 框與anchor對應寬比和高比 而不是常用的iou 這點也與nms的篩選條件對應 是yolov5中使用的新方法
        :params k: anchor框
        :params wh: 整個資料集的 wh [N, 2]
        :return x: [N, 9] N 個 ground truth 框與所有 anchor 框的寬比或高比(兩者之中較小者)
        :return x.max(1)[0]: [N] N個 ground truth 框與所有 anchor 框中的最大寬比或高比(兩者之中較小者)
        """
        # [N, 1, 2] / [1, 9, 2] = [N, 9, 2]  N個gt_wh和9個anchor的k_wh寬比和高比
        # 兩者的重合程度越高 就越趨近於1 遠離1(<1 或 >1)重合程度都越低
        r = wh[:, None] / k[None]
        # r=gt_height/anchor_height  gt_width / anchor_width  有可能大於1，也可能小於等於1
        # flow.min(r, 1. / r): [N, 9, 2] 將所有的寬比和高比統一到 <=1
        # .min(2): value=[N, 9] 選出每個 ground truth 個和 anchor 的寬比和高比最小的值   index: [N, 9] 這個最小值是寬比(0)還是高比(1)
        # [0] 返回 value [N, 9]  每個 ground truth 個和 anchor 的寬比和高比最小的值 就是所有 ground truth 與 anchor 重合程度最低的
        x = flow.min(r, 1. / r).min(2)[0]  # ratio metric
        # x = wh_iou(wh, flow.tensor(k))  # IoU metric
        # x.max(1)[0]: [N] 返回每個 ground truth 和所有 anchor(9個) 中寬比/高比最大的值
        return x, x.max(1)[0]  # x, best_x

    def anchor_fitness(k):   # mutation fitness
        """用於 kmean_anchors 函數
        適應度計算 優勝劣汰 用於遺傳演演算法中衡量突變是否有效的標註 如果有效就進行選擇操作，無效就繼續下一輪的突變
        :params k: [9, 2] K-means生成的 9 個anchors     wh: [N, 2]: 資料集的所有 ground truth 框的寬高
        :return (best * (best > thr).float()).mean()=適應度計算公式 [1] 注意和BPR有區別 這裡是自定義的一種適應度公式
                返回的是輸入此時anchor k 對應的適應度
        """
        _, best = metric(flow.tensor(k, dtype=flow.float32), wh)
        return (best * (best > thr).float()).mean()  # fitness

    def print_results(k):
        """用於 kmean_anchors 函數中列印K-means計算相關資訊
        計算BPR、aat=>列印資訊: 閾值+BPR+aat  anchor個數+圖片大小+metric_all+best_mean+past_mean+Kmeans聚類出來的anchor框(四捨五入)
        :params k: K-means得到的anchor k
        :return k: input
        """
        # 將K-means得到的anchor k按面積從小到大排序
        k = k[np.argsort(k.prod(1))]
        # x: [N, 9] N個 ground truth 框與所有anchor框的寬比或高比(兩者之中較小者)
        # best: [N] N個 ground truth 框與所有anchor框中的最大寬比或高比(兩者之中較小者)
        x, best = metric(k, wh0)
        # (best > thr).float(): True=>1.  False->0.  .mean(): 求均值
        # BPR(best possible recall): 最多能被召回(通過thr)的 ground truth 框數量 / 所有 ground truth 框數量  [1] 0.96223  小於0.98 才會用K-means計算anchor
        # aat(anchors above threshold): [1] 3.54360 每個target平均有多少個anchors
        BPR, aat = (best > thr).float().mean(), (x > thr).float().mean() * n  # best possible recall, anch > thr
        f = anchor_fitness(k)
        # print(f'{prefix}thr={thr:.2f}: {BPR:.4f} best possible recall, {aat:.2f} anchors past thr')
        # print(f'{prefix}n={n}, img_size={img_size}, metric_all={x.mean():.3f}/{best.mean():.3f}-mean/best, '
        #       f'past_thr={x[x > thr].mean():.3f}-mean: ', end='')
        print(f"aat: {aat:.5f}, fitness: {f:.5f}, best possible recall: {BPR:.5f}")
        for i, x in enumerate(k):
            print('%i,%i' % (round(x[0]), round(x[1])), end=',  ' if i < len(k) - 1 else '\n')  # use in *.cfg

        return k


    # 載入資料集
    if isinstance(path, str):  # *.yaml file
        with open(path) as f:
            data_dict = yaml.safe_load(f)  # model dict
        from utils.datasets import LoadImagesAndLabels
        dataset = LoadImagesAndLabels(data_dict['train'], augment=True, rect=True)
    else:
        dataset = path  # dataset

    # 得到資料集中所有資料的 wh
    # 將資料集圖片的最長邊縮放到 img_size, 較小邊相應縮放
    shapes = img_size * dataset.shapes / dataset.shapes.max(1, keepdims=True)
    # 將原本資料集中gt boxes歸一化的wh縮放到shapes尺度
    wh0 = np.concatenate([l[:, 3:5] * s for s, l in zip(shapes, dataset.labels)])

    # 統計gt boxes中寬或者高小於 3 個畫素的個數, 目標太小 發出警告
    i = (wh0 < 3.0).any(1).sum()
    if i:
        print(f'{prefix}WARNING: Extremely small objects found. {i} of {len(wh0)} labels are < 3 pixels in size.')

    # 篩選出 label 大於 2 個畫素的框拿來聚類, [...]內的相當於一個篩選器, 為True的留下
    wh = wh0[(wh0 >= 2.0).any(1)]  # filter > 2 pixels
    # wh = wh * (np.random.rand(wh.shape[0], 1) * 0.9 + 0.1)  # multiply by random scale 0-1

    # Kmeans聚類方法: 使用歐式距離來進行聚類
    print(f'{prefix}Running kmeans for {n} anchors on {len(wh)} gt boxes...')
    # 計算寬和高的標準差->[w_std,h_std]
    s = wh.std(0)  # sigmas for whitening
    # 開始聚類,仍然是聚成 n 類,返回聚類後的anchors k(這個anchors k是白化後資料的anchor框s)
    # 另外還要注意的是這裡的kmeans使用歐式距離來計算的
    # 執行K-means的次數為30次  obs: 傳入的資料必須先白化處理 'whiten operation'
    # 白化處理: 新資料的標準差=1 降低資料之間的相關度，不同資料所蘊含的資訊之間的重複性就會降低，網路的訓練效率就會提高
    # 白化操作參考部落格: https://blog.csdn.net/weixin_37872766/article/details/102957235
    k, dist = kmeans(wh / s, n, iter=30)  # points, mean distance
    assert len(k) == n, print(f'{prefix}ERROR: scipy.cluster.vq.kmeans requested {n} points but returned only {len(k)}')
    k *= s  # k*s 得到原來資料(白化前)的 anchor 框

    wh = flow.tensor(wh, dtype=flow.float32)  # filtered wh
    wh0 = flow.tensor(wh0, dtype=flow.float32)  # unfiltered wh0

    # 輸出新算的anchors k 相關的資訊
    k = print_results(k)

    # Plot wh
    # k, d = [None] * 20, [None] * 20
    # for i in tqdm(range(1, 21)):
    #     k[i-1], d[i-1] = kmeans(wh / s, i)  # points, mean distance
    # fig, ax = plt.subplots(1, 2, figsize=(14, 7), tight_layout=True)
    # ax = ax.ravel()
    # ax[0].plot(np.arange(1, 21), np.array(d) ** 2, marker='.')
    # fig, ax = plt.subplots(1, 2, figsize=(14, 7))  # plot wh
    # ax[0].hist(wh[wh[:, 0]<100, 0], 400)
    # ax[1].hist(wh[wh[:, 1]<100, 1], 400)
    # fig.savefig('wh.png', dpi=200)

    # Evolve 類似遺傳/進化演演算法  變異操作
    npr = np.random   # 隨機工具
    # f: fitness 0.62690
    # sh: (9,2)
    # mp: 突變比例mutation prob=0.9   s: sigma=0.1
    f, sh, mp, s = anchor_fitness(k), k.shape, 0.9, 0.1  # fitness, generations, mutation prob, sigma
    pbar = tqdm(range(gen), desc=f'{prefix}Evolving anchors with Genetic Algorithm:')  # progress bar
    # 根據聚類出來的n個點採用遺傳演演算法生成新的anchor
    for _ in pbar:
        # 重複1000次突變+選擇 選擇出1000次突變裡的最佳anchor k和最佳適應度f
        v = np.ones(sh)  # v [9, 2] 全是1
        while (v == 1).all():
            # 產生變異規則 mutate until a change occurs (prevent duplicates)
            # npr.random(sh) < mp: 讓v以90%的比例進行變異  選到變異的就為1  沒有選到變異的就為0
            v = ((npr.random(sh) < mp) * npr.random() * npr.randn(*sh) * s + 1).clip(0.3, 3.0)
        # 變異(改變這一時刻之前的最佳適應度對應的anchor k)
        kg = (k.copy() * v).clip(min=2.0)
        # 計算變異後的anchor kg的適應度
        fg = anchor_fitness(kg)
        # 如果變異後的anchor kg的適應度>最佳適應度k 就進行選擇操作
        if fg > f:
            # 選擇變異後的anchor kg為最佳的anchor k 變異後的適應度fg為最佳適應度f
            f, k = fg, kg.copy()

            # 列印資訊
            pbar.desc = f'{prefix}Evolving anchors with Genetic Algorithm: fitness = {f:.4f}'
            if verbose:
                print_results(k)
    return print_results(k)

1.3 自適應圖片縮放

在常用的目標檢測演演算法中，不同的圖片長寬都不相同，因此常用的方式是將原始圖片縮放填充到標準尺寸，再送入檢測網路中。在專案實際使用時，很多圖片的長寬比不同，因此縮放填充後，兩端的黑邊大小都不同，而如果填充的比較多，則存在資訊冗餘，影響推理速度。因此在Yolov5的程式碼中utils/augmentations.py的letterbox函數中進行了修改，對原始影象自適應的新增最少的黑邊。

Yolov5的letterbox函數（utils/augmentations.py）

假設圖片原來尺寸為（1080， 1920），我們想要resize的尺寸為（640，640）。要想滿足收縮的要求，640/1080= 0.59，640/1920 = 0.33，應該選擇更小的收縮比例0.33，則圖片被縮放為（360，640）。下一步則要填充灰白邊至360可以被32整除，則應該填充至384，最終得到圖片尺寸（384，640）。

def letterbox(im, new_shape=(640, 640), color=(114, 114, 114), auto=True, scaleFill=False, scaleup=True, stride=32):
    # Resize and pad image while meeting stride-multiple constraints
    shape = im.shape[:2]  # current shape [height, width] 當前影象尺寸
    if isinstance(new_shape, int):
        new_shape = (new_shape, new_shape)  # 縮放後的尺寸

    # Scale ratio (new / old) 計算縮放比例（選擇長寬中更小的那個縮放比例）
    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    if not scaleup:  # only scale down, do not scale up (for better val mAP)
        r = min(r, 1.0)

    # Compute padding
    ratio = r, r  # width, height ratios
    new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))  # 直接縮放後的寬高
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # wh padding 計算灰邊填充數值
    if auto:  # minimum rectangle 採用自適應圖片縮放，確保寬和高都能被stride整除，因此需要補邊
        dw, dh = np.mod(dw, stride), np.mod(dh, stride)  # wh padding 取餘np.mod
    elif scaleFill:  # stretch  不採用自適應縮放，直接resize到目標shape，無需補邊
        dw, dh = 0.0, 0.0
        new_unpad = (new_shape[1], new_shape[0])
        ratio = new_shape[1] / shape[1], new_shape[0] / shape[0]  # width, height ratios

    dw /= 2  # divide padding into 2 sides
    dh /= 2  # 上下和左右兩側各 padding 一半

    if shape[::-1] != new_unpad:  # resize
        im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))  # 上下兩側需要padding的大小
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))  # 左右兩側需要padding的大小
    im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # add border 填充114
    return im, ratio, (dw, dh)

2 BackBone

YOLOv1的Backbone總共24個折積層和２個全連線層，使用了Leaky ReLu啟用函數，但並沒有引入BN層。
YOLOv2的Backbone在YOLOv1的基礎上設計了Darknet-19網路，包含19個折積層並引入了BN層優化模型整體效能。
YOLOv3將YOLOv2的Darknet-19加深了網路層數，並引入了ResNet的殘差思想，也正是殘差思想讓YOLOv3將Backbone深度大幅擴充套件至Darknet-53。
YOLOv4的Backbone在YOLOv3的基礎上，受CSPNet網路結構啟發，將多個CSP子模組進行組合設計成為CSPDarknet53，並且使用了Mish啟用函數（除Backbone以外的網路結構依舊使用LeakyReLU啟用函數）。CSPDarknet53總共有72層折積層，遵循YOLO系列一貫的風格，這些折積層都是3*3 大小，步長為2的設定，能起到特徵提取與逐步下取樣的作用。
YOLOv5的Backbone同樣使用了YOLOv4中使用的CSP思想。YOLOv5最初版本中會存在Focus結構，在YOLOv5第六版開始後，就捨棄了這個結構改用常規折積，其產生的引數更少，效果更好。

2.1 CSP

CSP結構的核心思想是將輸入特徵圖分成兩部分，一部分經過一個小的折積網路（稱為子網路）進行處理，另一部分則直接進行下一層的處理。然後將兩部分特徵圖拼接起來，作為下一層的輸入。Yolov4和Yolov5都使用了CSP結構，yolov4只在backbone中使用了CSP結構，yolov5有兩種CSP結構，以Yolov5s網路為例，CSP1_X結構應用於Backbone主幹網路，另一種CSP2_X結構則應用於Neck中。殘差元件由兩個CBL組成，因此兩個CSP的區別在於有沒有shortcut（通過BottleneckCSP類的shortcut引數設定）。

在YOLOv5 v4.0中，作者將BottleneckCSP模組轉變為了C3模組，經歷過殘差輸出後的Conv模組被去掉了。C3包含了3個標準折積層以及多個Bottleneck模組（數量由組態檔.yaml的n和depth_multiple引數乘積決定），concat後的標準折積模組中的啟用函數也由LeakyRelu變為了SiLU。

YOLOv5中的C3類

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

class C3(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)   # Conv = conv+BN+SiLU
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)))  # 串聯n個殘差結構
        # self.m = nn.Sequential(*(CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)))

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1))

3 Neck

yolov1、yolov2沒有使用Neck模組，yolov3開始使用。Neck模組的目的是融合不同層的特徵檢測大中小目標。
yolov3的NECK模組引入了FPN的思想，並對原始FPN進行修改。
yolov4的Neck模組主要包含了SPP模組和PAN模組。SPP，即空間金字塔池化。SPP的目的是解決了輸入資料大小任意的問題。SPP網路用在YOLOv4中的目的是增加網路的感受野。
yolov5的Neck側也使用了SPP模組和PAN模組，但是在PAN模組進行融合後，將YOLOv4中使用的CBL模組替換成借鑑CSPnet設計的CSP_v5結構，加強網路特徵融合的能力。

3.1 SPP/SPPF

2014年何愷明提出了空間金字塔池化SPP，能將任意大小的特徵圖轉換成固定大小的特徵向量。在Yolov5中，SPP的目的是在不同尺度下對影象進行池化（Pooling）。這種結構可以在不同尺寸的特徵圖上利用ROI池化不同尺度下的特徵資訊，提高模型的精度和效率。在YOLOv5的實現中，SPP結構主要包含兩個版本，分別為SPP和SPPF。其中，SPP代表「Spatial Pyramid Pooling」，而SPPF則代表「Fast Spatial Pyramid Pooling」。兩者目的是相同的，只是在結構上略有差異，從SPP改進為SPPF後(Yolov5 6.0)，模型的計算量變小了很多，模型速度提升。結構圖如下圖所示，下面的Conv是CBS=conv+BN+SiLU。

YOLOv5中的SPP/SPPF類

class SPP(nn.Module):
    # Spatial Pyramid Pooling (SPP) layer https://arxiv.org/abs/1406.4729
    def __init__(self, c1, c2, k=(5, 9, 13)):  # 5, 9, 13為初始化的kernel size
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)          # 通道減半
        self.cv2 = Conv(c_ * (len(k) + 1), c2, 1, 1)  # concat之後的CBS
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])

    def forward(self, x):
        x = self.cv1(x)
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
            return self.cv2(torch.cat([x] + [m(x) for m in self.m], 1))


class SPPF(nn.Module):
    # Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher
    def __init__(self, c1, c2, k=5):  # equivalent to SPP(k=(5, 9, 13))
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * 4, c2, 1, 1)
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)

    def forward(self, x):
        x = self.cv1(x)
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
            y1 = self.m(x)
            y2 = self.m(y1) # 串聯k=5的池化，會獲得9和13的池化，所以是等效的，但是時間更快
            return self.cv2(torch.cat((x, y1, y2, self.m(y2)), 1))

3.2 PAN

論文：Path Aggregation Network for Instance Segmentation

PANet是香港中文大學 2018 作品，在COCO2017的範例分割上獲得第一，在目標檢測任務上獲得第二。作者通過研究Mask R-CNN發現底層特徵難以傳達到高層次，因此設計了自下而上的路徑增強，如下圖裡的（b）所示，（c）是Adaptive feature pooling。紅色線表達了影象底層特徵在FPN中的傳遞路徑，要經過100多層layers；綠色線表達了影象底層特徵在PANnet 中的傳遞路徑，只需要經過小於10層layers。

Yolov5中的PAN結構

FPN層自頂向下傳達強語意特徵（高層語意是經過特徵提取後得到的特徵資訊，它的感受野較大，提取的特徵抽象，有利於物體的分類，但會丟失細節資訊，不利於精確分割。高層語意特徵是抽象的特徵）。而PAN則自底向上傳達強定位特徵，兩兩聯手，從不同的主幹層對不同的檢測層進行引數聚合。原本的PANet網路的PAN結構中，兩個特徵圖結合是採用shortcut操作，而Yolov4/5中則採用concat操作，特徵圖融合後的尺寸發生了變化。

4 輸出端

4.1 正樣本取樣

什麼是正負樣本？

正負樣本都是針對於演演算法經過處理生成的框，用於計算損失，而在預測過程和驗證過程是沒有這個概念的。正例用來使預測結果更靠近真實值的，負例用來使預測結果更遠離除了真實值之外的值的。正負樣本的比例最好為1：1到1：2左右，數量差距不能太懸殊，特別是正樣本數量本來就不太多的情況下。如果負樣本遠多於正樣本，則負樣本會淹沒正樣本的損失，從而降低網路收斂的效率與檢測精度。這就是目標檢測中常見的正負樣本不均衡問題，解決方案之一是增加正樣本數。

yolov5通過以下三個方法增加正樣本數量：
(1) 跨anchor預測
假設一個GT框落在了某個預測分支的某個網格內，該網格具有3種不同大小anchor，若GT可以和這3種anchor中的多種anchor匹配，則這些匹配的anchor都可以來預測該GT框，即一個GT框可以使用多種anchor來預測。預測邊框的寬高是基於anchor來預測的，而預測的比例值是有範圍的，即0-4，如果標籤的真實寬高與anchor的寬高的比例超過了4，那是不可能預測成功的，所以哪些anchor能匹配上哪些標籤，就看anchor的寬(高)與標籤的寬(高)的比例有沒有超過4，如果超過了，那就不匹配。注意，這個比例是雙向的比例，比如標籤寬/anchor寬>4，不匹配，而anchor寬/標籤寬>4，也是不匹配的。

(2) 跨grid預測

假設一個GT框落在了某個預測分支的某個網格內，則該網格有左、上、右、下4個鄰域網格，根據GT框的中心位置，將最近的2個鄰域網格也作為預測網格，也即一個GT框可以由3個網格來預測。有下面5種情況（如果標籤邊框的中心點正好落在格子中間，就只有這個格子了）：

(3) 跨分支預測
假設一個GT框可以和2個甚至3個預測分支上的anchor匹配，則這2個或3個預測分支都可以預測該GT框，即一個GT框可以由多個預測分支來預測，重複anchor匹配和grid匹配的步驟，可以得到某個GT 匹配到的所有正樣本。

yolov5的正樣本匹配：即找到與targets對應的所有正樣本

    def build_targets(self, p, targets):
        # Build targets for compute_loss(), input targets(image,class,x,y,w,h)
        na, nt = self.na, targets.shape[0]  # na為類別數，nt為目標數
        tcls, tbox, indices, anch = [], [], [], []
        gain = torch.ones(7, device=self.device)  # normalized to gridspace gain
        ai = torch.arange(na, device=self.device).float().view(na, 1).repeat(1, nt)  # ai.shape = (na, nt)，錨框的索引，第二個維度複製nt遍
        targets = torch.cat((targets.repeat(na, 1, 1), ai[..., None]), 2)  # targets.shape = (na, nt, 7)給每個目標加上錨框索引

        g = 0.5  # bias
        off = torch.tensor(
            [
                [0, 0],
                [1, 0],
                [0, 1],
                [-1, 0],
                [0, -1],  # j,k,l,m
                # [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm
            ],
            device=self.device).float() * g  # offsets

        for i in range(self.nl):  # self.nl為預測層也就是檢測頭的數量，anchor匹配需要逐層進行
            anchors = self.anchors[i]  # 該預測層上的anchor尺寸，三個尺寸
            gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # 比如在P3層 gain=tensor([ 1.,  1., 80., 80., 80., 80.,  1.], device='cuda:0')

            # Match targets to anchors
            t = targets * gain  # shape(3,n,7) 將歸一化的gtbox乘以特徵圖尺度，將box座標投影到特徵圖上
            if nt:
                # Matches
                r = t[..., 4:6] / anchors[:, None]  # 計算標籤box和當前層的anchors的寬高比，即:wb/wa,hb/ha
                j = torch.max(r, 1 / r).max(2)[0] < self.hyp['anchor_t']  # 將比值和預先設定的比例anchor_t對比，符合條件為True，反之False
                # j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n)=wh_iou(anchors(3,2), gwh(n,2))
                t = t[j]  # 篩選出符合條件target

                # Offsets
                gxy = t[:, 2:4]  # 得到相對於以左上角為座標原點的座標                          假設某個gt的中心點為gxy=[22.20, 19.05]
                gxi = gain[[2, 3]] - gxy  # 得到相對於右下角為座標原點的座標                   此時gxi=[17.80, 20.95]
                j, k = ((gxy % 1 < g) & (gxy > 1)).T  # jk判斷gxy的中心點是否更偏向左上角     g=0.5 操作%1得到小數部分，小於0.5，所以j，k均為True
                l, m = ((gxi % 1 < g) & (gxi > 1)).T  # lm判斷gxy的中心點是否更偏向右下角     g=0.5 l,m均為False，該舞臺中心更偏向於左上角
                j = torch.stack((torch.ones_like(j), j, k, l, m))  # 網格本身是True，再加上 上下左右
                t = t.repeat((5, 1, 1))[j]  # 這裡將t複製5個，然後使用j來過濾
                offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
            else:
                t = targets[0]
                offsets = 0

            # Define
            bc, gxy, gwh, a = t.chunk(4, 1)  # (image, class), grid xy, grid wh, anchors
            a, (b, c) = a.long().view(-1), bc.long().T  # anchors, image, class 其中，a表示當前gt box和當前層的第幾個anchor匹配上了
            gij = (gxy - offsets).long()  # .long()為取整 gij是gxy的整數部分
            gi, gj = gij.T  # grid indices (gi,gj)是我們計算出來的負責預測該gt box的網格的座標。

            # Append
            # indices中是正樣本所對應的gt的資訊  b表示當前正樣本對應的gt屬於該batch內第幾張圖片，a表示gtbox與anchors的對應關係，gj負責預測的網格縱座標，gi負責預測的網格橫座標
            indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))  # image, anchor, grid indices
            # tbox, anch, tcls是正樣本自己的資訊
            tbox.append(torch.cat((gxy - gij, gwh), 1))  # 正樣本相對網格的偏移，寬高
            anch.append(anchors[a])  # 正樣本對應的anchor資訊
            tcls.append(c)  # 正樣本的類別資訊

        return tcls, tbox, indices, anch

4.2 損失計算

Yolov5官方檔案：https://docs.ultralytics.com/yolov5/tutorials/architecture_description/?h=loss

損失函數的呼叫點如下，在train.py中

pre：網路從三個特徵圖上得到3*（20*20+40*40+52*52）個先驗框，每個先驗框由6個引數：px，py，pw，ph，po和pcls

targets：一個batch中所有的目標（如果開啟開啟mosaic資料增強的話，每張圖就包含原本多張圖中的目標），每個目標有(image,class,x,y,w,h)共6個引數，shape=[ num，6]。

損失函數分三部分：(1)分類損失Lcls （BCE loss） (2)置信度損失Lobj（BCE loss） (3)邊框損失Lloc（CIOU loss）

其中置信度損失在三個預測層（P3, P4, P5)上權重不同，分別為[4.0, 1.0, 0.4]

這三者的權重都是可以設定的，在預設的data/hyps/hyp.scratch-low.yaml中，如下圖

這三個損失權重會根據類別、影象大小、檢測層數量進行scale

4.2.1 分類損失

按照640乘640解析度，3個輸出層來算的話，P3是80乘80個格子，P4是40乘40，P5是20乘20，一共有8400個格子，並不是每一個格子上的輸出都要去做分類損失計算的，只有負責預測對應物體的格子才需要做分類損失計算（邊框損失計算也是一樣）。

分類損失採用nn.BCEWithLogitsLoss，即二分類損失，比如現在有4個分類：貓、狗、豬、雞，當前標籤真值為豬，那麼計算損失的時候，targets就是[0, 0, 1, 0]，推理結果的分類部分也會有4個值，分別是4個分類的概率，就相當於計算4次二分類損失，取均值。分類的真值也不一定是0或1，因為可以做label smoothing。

# Classification
if self.nc > 1:  # cls loss (only if multiple classes)
    t = torch.full_like(pcls, self.cn, device=self.device)  # torch.full_like返回一個形狀與pcls相同且值全為self.cn的張量
    t[range(n), tcls[i]] = self.cp  # 對應類別處為self.cp， 其餘類別處為self.cn
    lcls += self.BCEcls(pcls, t)  # BCE

4.2.2 置信度損失

每一個格子都要計算置信度損失，置信度的真值並不是固定的，如果該格子負責預測對應的物體，那麼置信度真值就是預測邊框與標籤邊框的IOU。如果不負責預測任何物體，那真值就是0。

與早期版本的YOLO相比，YOLOv5架構對預測框策略進行了更改。在YOLOv2和YOLOv3中，使用最後一層的啟用直接預測框座標。如下圖所示

而在YOLOv5中，用於預測框座標的公式已經更新，以降低網格靈敏度，並防止模型預測沒有邊界。計算預測邊界框的修訂公式如下：

Yolov5預測框座標計算，以與target的iou計算

                # pxy, pwh, _, pcls = pi[b, a, gj, gi].tensor_split((2, 4, 5), dim=1)  # faster, requires torch 1.8.0
                pxy, pwh, _, pcls = pi[b, a, gj, gi].split((2, 2, 1, self.nc), 1)  # target-subset of predictions

                # Regression
                pxy = pxy.sigmoid() * 2 - 0.5
                pwh = (pwh.sigmoid() * 2) ** 2 * anchors[i]
                pbox = torch.cat((pxy, pwh), 1)  # predicted box
                iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze()  # iou(prediction, target)
                lbox += (1.0 - iou).mean()  # iou loss

                # Objectness
                iou = iou.detach().clamp(0).type(tobj.dtype)
                if self.sort_obj_iou:
                    j = iou.argsort()
                    b, a, gj, gi, iou = b[j], a[j], gj[j], gi[j], iou[j]
                if self.gr < 1:
                    iou = (1.0 - self.gr) + self.gr * iou
                tobj[b, a, gj, gi] = iou  # iou ratio

4.2.3 邊框損失

Bounding Box Regeression的Loss近些年的發展過程是：Smooth L1 Loss-> IoU Loss（2016）-> GIoU Loss（2019）-> DIoU Loss（2020）->CIoU Loss（2020），Yolov5用的是CIOU。

其中，ρ預測框和真實框的中心點的歐式距離，也就是圖中的d，c代表的是能夠同時包含預測框和真實框的最小閉包區域的對角線距離，v測量縱橫比的一致性，α是正的權衡引數，

Yolov5中IOU、CIoU、DIoU、GIoU的計算

論文：Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression

def bbox_iou(box1, box2, xywh=True, GIoU=False, DIoU=False, CIoU=False, eps=1e-7):
    # Returns Intersection over Union (IoU) of box1(1,4) to box2(n,4)

    # Get the coordinates of bounding boxes
    if xywh:  # transform from xywh to xyxy
        (x1, y1, w1, h1), (x2, y2, w2, h2) = box1.chunk(4, 1), box2.chunk(4, 1)
        w1_, h1_, w2_, h2_ = w1 / 2, h1 / 2, w2 / 2, h2 / 2
        b1_x1, b1_x2, b1_y1, b1_y2 = x1 - w1_, x1 + w1_, y1 - h1_, y1 + h1_
        b2_x1, b2_x2, b2_y1, b2_y2 = x2 - w2_, x2 + w2_, y2 - h2_, y2 + h2_
    else:  # x1, y1, x2, y2 = box1
        b1_x1, b1_y1, b1_x2, b1_y2 = box1.chunk(4, 1)
        b2_x1, b2_y1, b2_x2, b2_y2 = box2.chunk(4, 1)
        w1, h1 = b1_x2 - b1_x1, b1_y2 - b1_y1 + eps
        w2, h2 = b2_x2 - b2_x1, b2_y2 - b2_y1 + eps

    # Intersection area
    inter = (torch.min(b1_x2, b2_x2) - torch.max(b1_x1, b2_x1)).clamp(0) * \
            (torch.min(b1_y2, b2_y2) - torch.max(b1_y1, b2_y1)).clamp(0)

    # Union Area
    union = w1 * h1 + w2 * h2 - inter + eps

    # IoU
    iou = inter / union
    if CIoU or DIoU or GIoU:
        cw = torch.max(b1_x2, b2_x2) - torch.min(b1_x1, b2_x1)  # convex (smallest enclosing box) width
        ch = torch.max(b1_y2, b2_y2) - torch.min(b1_y1, b2_y1)  # convex height
        if CIoU or DIoU:  # Distance or Complete IoU https://arxiv.org/abs/1911.08287v1
            c2 = cw ** 2 + ch ** 2 + eps  # convex diagonal squared
            rho2 = ((b2_x1 + b2_x2 - b1_x1 - b1_x2) ** 2 + (b2_y1 + b2_y2 - b1_y1 - b1_y2) ** 2) / 4  # center dist ** 2
            if CIoU:  # https://github.com/Zzh-tju/DIoU-SSD-pytorch/blob/master/utils/box/box_utils.py#L47
                v = (4 / math.pi ** 2) * torch.pow(torch.atan(w2 / h2) - torch.atan(w1 / h1), 2)
                with torch.no_grad():
                    alpha = v / (v - iou + (1 + eps))
                return iou - (rho2 / c2 + v * alpha)  # CIoU
            return iou - rho2 / c2  # DIoU
        c_area = cw * ch + eps  # convex area
        return iou - (c_area - union) / c_area  # GIoU https://arxiv.org/pdf/1902.09630.pdf
    return iou  # IoU

參考：

5. 複製-貼上大法（Copy-Paste）：簡單而有效的資料增強

6. 資料增強mixup技術

7. YOLOv5網路模型的結構原理講解

8. Yolov5核心基礎知識完整講解

9. YOLOv5 Autoanchor 機制詳解