Faster R-CNN論文翻譯筆記

百度網路硬碟論文鏈接，提取碼：kk89
https://pan.baidu.com/s/12RDu3WLgH5WcV_Mo3q02xg

或者去arxiv下載《Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks》.

翻譯論文只是爲了記錄一下自己的學習過程，作爲一名目標檢測初學者，專業和英文水平都有限，一些專業名詞和語法如果不準不準確或者錯誤，還請大家見諒並指出。

在这里插入图片描述

Abstract

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features——using the recently popular terminology of neural networks with 「attention」 mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.
最先進的目標檢測網路依賴於region proposal（候選區域）演算法來假設目標位置。像SPPnet[7]和Fast R-CNN[5]之類的進步已經減少了這些檢測網路的執行時間，暴露出region proposal計算是一個瓶頸。在這項工作中，我們引入了一個區域生成網路（RPN），它與檢測網路共用完整的影象折積特徵，從而實現幾乎無成本的候選區域。RPN是一種全折積網路，它可以同時預測每個位置的目標邊界和目標得分。RPNs 經過端到端的訓練，會生成高品質的候選區域，由Fast R-CNN用於檢測。通過簡單的交替優化，可以訓練RPN和Fast R-CNN共用折積特徵。對於VGG-16模型[19]，我們的檢測系統在GPU上的幀速率爲5fps（包括所有步驟），同時在PASCAL VOC 2007（73.2%mAP）和2012（70.4%mAP）上實現了最先進的目標檢測精度，每張影象使用300個proposals。

1 Introduction

Recent advances in object detection are driven by the success of region proposal methods (e.g., [4]) and region-based convolutional neural networks (R-CNNs) [5]. Although region-based CNNs were computationally expensive as originally developed in [5], their cost has been drastically reduced thanks to sharing convolutions across proposals [1], [2]. The latest incarnation, Fast R-CNN [2], achieves near real-time rates using very deep networks [3], when ignoring the time spent on region proposals. Now, proposals are the test-time computational bottleneck in state-of-the-art detection systems.
候選區域方法和基於區域的折積神經網路（R-CNNs）成功推動了目標檢測的最新進展。儘管基於區域的cnn最初在[6]中開發的計算成本很高，但由於在方案[7,5]中共用了複雜度，因此它們的成本已經大幅降低。最新顯示，Fast R-CNN，利用非常深的網路實現接近實時的速率，而忽略了在候選區域上花費的時間。現在，proposals是最先進檢測系統的計算瓶頸。

Region proposal methods typically rely on inexpensive features and economical inference schemes. Selective Search [4], one of the most popular methods, greedily merges superpixels based on engineered low-level features. Yet when compared to efficient detection networks [2], Selective Search is an order of magnitude slower, at 2 seconds per image in a CPU implementation. EdgeBoxes [6] currently provides the best tradeoff between proposal quality and speed, at 0.2 seconds per image. Nevertheless, the region proposal step still consumes as much running time as the detection network.
候選區域方法通常依賴於簡單的特徵和簡練的推斷方案。Selective Search（SS）[22]是最流行的方法之一，它貪婪地基於工程化的低層特徵合併超畫素。與高效的檢測網路相比，Selective Search速度慢了一個數量級，在CPU實現中每張影象的時間爲2秒。EdgeBoxes目前提出了在proposal品質和速度之間的最佳權衡，每張影象0.2秒。然而，region proposal 步驟仍然消耗像檢測網路同樣多的執行時間。

One may note that fast region-based CNNs take advantage of GPUs, while the region proposal methods used in research are implemented on the CPU, making such runtime comparisons inequitable. An obvious way to accelerate proposal computation is to re-implement it for the GPU. This may be an effective engineering solution, but re-implementation ignores the down-stream detection network and therefore misses important opportunities for sharing computation.
可能會注意到，基於區域的快速CNN充分利用了GPU的優勢，而研究中使用的region proposal方法則是在CPU上實現的，因此這種執行時比較是不公平的。加速region proposal計算的一個顯而易見的方法是將其在GPU上重新實現。這可能是一個有效的工程解決方案，但重新實現忽略了下遊檢測網路，因此會錯過共用計算的重要機會。

In this paper, we show that an algorithmic change——computing proposals with a deep convolutional neural network——leads to an elegant and effective solution where proposal computation is nearly cost-free given the detection network’s computation. To this end, we introduce novel Region Proposal Networks (RPNs) that share convolutional layers with state-of-the-art object detection networks [1], [2]. By sharing convolutions at test-time, the marginal cost for computing proposals is small (e.g., 10ms per image).
在本文中，我們展示了演算法的變化——用深度折積神經網路計算region proposal——獲得了一個優雅而有效的解決方案，其中在給定檢測網路計算的情況下region proposal計算接近零成本。爲此，我們引入了新的region proposal網路（RPN），它們共用最先進目標檢測網路的折積層。通過在測試時共用折積，計算region proposal的邊際成本很小（例如，每張影象10ms）。

Our observation is that the convolutional feature maps used by region-based detectors, like Fast R-CNN, can also be used for generating region proposals. On top of these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) [7] and can be trained end-to-end specifically for the task for generating detection proposals.
我們的觀察結果是基於區域的檢測器所使用的折積特徵對映，如Fast-RCNN，也可以用於生成region proposal。在這些折積特徵的基礎上，我們通過新增一些額外的折積層來構造一個RPN，這些折積層同時在規則網格上的每個位置迴歸區域邊界和目標得分。因此，RPN是一種完全折積網路（FCN），可以專門針對生成檢測建議的任務進行端到端的訓練。

RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. In contrast to prevalent methods [8], [9], [1], [2] that use pyramids of images (Figure 1, a) or pyramids of filters (Figure 1, b), we introduce novel 「anchor」 boxes that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references (Figure 1, c), which avoids enumerating images or filters of multiple scales or aspect ratios. This model performs well when trained and tested using single-scale images and thus benefits running speed.
RPN旨在有效預測具有廣泛尺度和長寬比的region proposal。與使用影象金字塔（圖1 a）或濾波器金字塔（圖1 b）的流行方法[8]，[9]，[1]，[2]相比，我們引入新的「anchor」框作爲多種尺度和長寬比的參考。我們的方案可以被認爲是迴歸參考金字塔（圖1 c），它避免了遍歷多種比例或長寬比的影象或濾波器。這個模型在使用單尺度影象進行訓練和測試時執行良好，從而有利於提升執行速度。
在这里插入图片描述
圖1：解決多尺度和尺寸的不同方案。（a）構建影象和特徵對映金字塔，分類器以各種尺度執行。（b）在特徵對映上執行具有多個比例/大小的濾波器的金字塔。（c）我們在迴歸函數中使用參考邊界框金字塔。

To unify RPNs with Fast R-CNN [2] object detection networks, we propose a training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with convolutional features that are shared between both tasks.
爲了將RPN與Fast R-CNN目標檢測網路相結合，我們提出了一種訓練方案，在fine-tune region proposal任務和fine-tune目標檢測之間進行交替，同時保持region proposal的固定。該方案收斂速度快，產生一個統一的網路，折積特徵在兩個任務之間共用。

We comprehensively evaluate our method on the PASCAL VOC detection benchmarks [11] where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of Selective Search at test-time——the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [3], our detection method still has a frame rate of 5fps (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy. We also report results on the MS COCO dataset [12] and investigate the improvements on PASCAL VOC using the COCO data. Code has been made publicly available at https://github.com/shaoqingren/faster_rcnn (in MATLAB) and https://github.com/rbgirshick/py-faster-rcnn (in Python).
我們在PASCAL VOC檢測基準數據集上綜合評估了我們的方法，其中具有Fast R-CNN的RPN產生的檢測精度優於使用Selective Search的Fast R-CNN的強基準模型。同時，我們的方法在測試時幾乎免除了Selective Search的所有計算負擔——region proposal的有效執行時間僅爲10毫秒。使用[3]的昂貴的非常深的模型，我們的檢測方法在GPU上仍然具有5fps的影格率（包括所有步驟），因此在速度和準確性方面是實用的目標檢測系統。我們還報告了在MS COCO數據集上[12]的結果，並使用COCO數據研究了在PASCAL VOC上的改進。程式碼可公開獲得https://github.com/shaoqingren/faster_rcnn（MATLAB實現）和https://github.com/rbgirshick/py-faster-rcnn（Python實現）。

A preliminary version of this manuscript was published previously [10]. Since then, the frameworks of RPN and Faster R-CNN have been adopted and generalized to other methods, such as 3D object detection [13], part-based detection [14], instance segmentation [15], and image captioning [16]. Our fast and effective object detection system has also been built in commercial systems such as at Pinterests [17], with user engagement improvements reported.
這篇稿件的初始版本是以前發表的[10]。從那時起，RPN和Faster R-CNN的框架已經被採用並推廣到其他方法，如3D目標檢測[13]，基於部件的檢測[14]，範例分割[15]和影象標題生成[16]。我們快速和有效的目標檢測系統也已經在Pinterest[17]的商業系統中進行了部署，並報告了使用者參與度的提高。

In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the basis of several 1st-place entries [18] in the tracks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. RPNs completely learn to propose regions from data, and thus can easily benefit from deeper and more expressive features (such as the 101-layer residual nets adopted in [18]). Faster R-CNN and RPN are also used by several other leading entries in these competitions. These results suggest that our method is not only a cost-efficient solution for practical usage, but also an effective way of improving object detection accuracy.
在ILSVRC和COCO 2015競賽中，Faster R-CNN和RPN是ImageNet檢測任務、ImageNet定位任務、COCO檢測任務和COCO分割任務中幾個第一名獲勝模型[18]的基礎。RPN完全從數據中學習propose regions，因此可以從更深入和更具表達性的特徵（例如[18]中採用的101層殘差網路）中輕鬆獲益。Faster R-CNN和RPN也被這些比賽中的其他幾個主要參賽者所使用。這些結果表明，我們的方法不僅是一個實用合算的解決方案，而且是一個提高目標檢測精度的有效方法。

2 Related Work

Object Proposals. There is a large literature on object proposal methods. Comprehensive surveys and comparisons of object proposal methods can be found in [19], [20], [21]. Widely used object proposal methods include those based on grouping super-pixels (e.g., Selective Search [4], CPMC [22], MCG [23]) and those based on sliding windows (e.g., objectness in windows [24], EdgeBoxes [6]). Object proposal methods were adopted as external modules independent of the detectors (e.g., Selective Search [4] object detectors, R-CNN [5], and Fast R-CNN [2]).
目標Proposals. 有大量關於目標Proposals方法的文獻。目標Proposals方法的綜合調查和比較可以在[19]，[20]，[21]中找到。廣泛使用的目標Proposals方法包括基於超畫素分組（例如，Selective Search [4]，CPMC[22]，MCG[23]）和那些基於滑動視窗的方法（例如視窗中的目標[24]，EdgeBoxes[6]）。目標Proposals方法被採用爲獨立於檢測器（例如，Selective Search [4]目標檢測器，R-CNN[5]和Fast R-CNN[2]）的外部模組。

Deep Networks for Object Detection. The R-CNN method [5] trains CNNs end-to-end to classify the proposal regions into object categories or background. R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module (see comparisons in [20]). Several papers have proposed ways of using deep networks for predicting object bounding boxes [25], [9], [26], [27]. In the OverFeat method [9], a fully-connected layer is trained to predict the box coordinates for the localization task that assumes a single object. The fully-connected layer is then turned into a convolutional layer for detecting multiple classspecific objects. The MultiBox methods [26], [27] generate region proposals from a network whose last fully-connected layer simultaneously predicts multiple class-agnostic boxes, generalizing the 「single-box」 fashion of OverFeat. These class-agnostic boxes are used as proposals for R-CNN [5]. The MultiBox proposal network is applied on a single image crop or multiple large image crops (e.g., 224×224), in contrast to our fully convolutional scheme. MultiBox does not share features between the proposal and detection networks. We discuss OverFeat and MultiBox in more depth later in context with our method. Concurrent with our work, the DeepMask method [28] is developed for learning segmentation proposals.
用於目標檢測的深度網路。R-CNN方法對CNN進行端到端的訓練，將proposal regions分類爲目標類別或背景。R-CNN主要作爲分類器，並不能預測目標邊界（除了通過邊界框迴歸進行修正）。其準確度取決於region proposal模組的效能（參見[20]中的比較）。一些論文提出了使用深度網路來預測目標邊界框的方法[25]，[9]，[26]，[27]。在OverFeat方法[9]中，訓練一個全連線層來預測假定單個目標定位任務的邊界框座標。然後將全連線層變成折積層，用於檢測多個類別的目標。MultiBox方法[26]，[27]從網路中生成region proposal，網路最後的全連線層同時預測多個類別不相關的邊界框，並推廣到OverFeat的「單邊界框」方式。這些類別不可知的邊界框框被用作R-CNN的候選區域。與我們的全折積方案相比，MultiBox提議網路適用於單張裁剪影象或多張大型裁剪影象（例如224×224）。MultiBox在提議區域和檢測網路之間不共用特徵。稍後在介紹我們的方法時會討論OverFeat和MultiBox。與我們的工作同時進行的DeepMask方法[28]是爲學習分割proposals而開發的。

Shared computation of convolutions [9], [1], [29], [7], [2] has been attracting increasing attention for efficient, yet accurate, visual recognition. The OverFeat paper [9] computes convolutional features from an image pyramid for classification, localization, and detection. Adaptively-sized pooling (SPP) [1] on shared convolutional feature maps is developed for efficient region-based object detection [1], [30] and semantic segmentation [29]. Fast R-CNN [2] enables end-to-end detector training on shared convolutional features and shows compelling accuracy and speed.
折積[9]，[1]，[29]，[7]，[2]的共用計算已經越來越受到人們的關注，因爲它可以有效而準確地進行視覺識別。OverFeat論文[9]計算影象金字塔的折積特徵用於分類、定位和檢測。共用折積特徵對映的自適應大小池化（SPP）[1]被開發用於有效的基於區域的目標檢測[1]，[30]和語意分割[29]。Fast R-CNN[2]能夠對共用折積特徵進行端到端的檢測器訓練，並顯示出令人信服的準確性和速度。

3 FASTER R-CNN

Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions. The entire system is a single, unified network for object detection (Figure 2). Using the recently popular terminology of neural networks with attention [31] mechanisms, the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.
我們的目標檢測系統，稱爲Faster R-CNN，由兩個模組組成。第一個模組是產生proposes regions的深度全折積網路，第二個模組是使用proposes regions的Fast R-CNN檢測器[2]。整個系統是一個單個的、統一的目標檢測網路（圖2）。使用最近流行的「注意力」[31]機制機製的神經網路術語，RPN模組告訴Fast R-CNN模組在哪裏尋找。在第3.1節中，我們介紹了region proposal網路的設計和屬性。在第3.2節中，我們開發了用於訓練具有共用特徵的兩個模組演算法。
在这里插入图片描述
圖2：Faster R-CNN是一個單一、統一的目標檢測網路。RPN模組作爲這個統一網路的「注意力」。

3.1 Region Proposal Networks

A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.3 We model this process with a fully convolutional network [7], which we describe in this section. Because our ultimate goal is to share computation with a Fast R-CNN object detection network [2], we assume that both nets share a common set of convolutional layers. In our experiments, we investigate the Zeiler and Fergus model[32] (ZF), which has 5 shareable convolutional layers and the Simonyan and Zisserman model [3] (VGG-16), which has 13 shareable convolutional layers.
region proposal網路（RPN）以任意大小的影象作爲輸入，輸出一組矩形的目標proposals，每個proposals都有一個目標得分。我們用全折積網路[7]對這個過程進行建模，我們將在本節進行描述。因爲我們的最終目標是與Fast R-CNN目標檢測網路[2]共用計算，所以我們假設兩個網路共用一組共同的折積層。在我們的實驗中，我們研究了具有5個共用折積層的Zeiler和Fergus模型[32]（ZF）和具有13個共用折積層的Simonyan和Zisserman模型[3]（VGG-16）。

To generate region proposals, we slide a small network over the convolutional feature map output by the last shared convolutional layer. This small network takes as input an n × n spatial window of the input convolutional feature map. Each sliding window is mapped to a lower-dimensional feature (256-d for ZF and 512-d for VGG, with ReLU [33] following). This feature is fed into two sibling fully-connected layers——a box-regression layer (reg) and a box-classification layer (cls). We use n = 3 in this paper, noting that the effective receptive field on the input image is large (171 and 228 pixels for ZF and VGG, respectively). This mini-network is illustrated at a single position in Figure 3 (left). Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations. This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1 × 1 convolutional layers (for reg and cls, respectively).
爲了生成region proposal，我們在最後的共用折積層輸出的折積特徵圖上滑動一個小網路。這個小網路將輸入折積特徵圖的n×n空間視窗作爲輸入。每個滑動視窗對映到一個低維特徵（ZF爲256維，VGG爲512維，後面是ReLU[33]）。這個特徵被輸入到兩個子全連線層——一個邊界框迴歸層（reg）和一個邊界框分類層（cls）。在本文中，我們使用n=3，注意輸入影象上的有效感受野是大的（ZF和VGG分別爲171和228個畫素）。圖3（左）所示爲這個小型網路。請注意，因爲小網路以滑動視窗方式執行，所以所有空間位置共用全連線層。這種架構通過一個n×n折積層、後面接兩個子1×1折積層（分別用於reg和cls）自然地實現了。

3.1.1 Anchors

At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal. The k proposals are parameterized relative to k reference boxes, which we call anchors. An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios, yielding k=9 anchors at each sliding position. For a convolutional feature map of a size W × H (typically ∼2,400), there are WHk anchors in total.
在每個滑動視窗位置，我們同時預測多個region proposal，其中每個位置可能候選的最大數目表示爲k。因此，reg層具有4k個輸出，分別編碼k個邊界框的座標，cls層輸出2k個得分值，分別估計每個proposal是目標或不是目標的概率。相對於我們稱之爲anchors的k個參考邊界框，k個proposals是參數化的（譯者注：也就是說進行了初始化，可能與ground truth有一定的差距，但可以通過反向傳播進行修正或調整）。anchors位於所討論的滑動視窗的中心，並且具有一定的尺度和長寬比（圖3左）。預設情況下，我們使用3個尺度和3個長寬比，在每個滑動位置產生k=9個anchors。對於大小爲W×H（通常約爲2400）的折積特徵圖，總共有WHk個anchors。
在这里插入图片描述
圖3：左圖爲region proposal網路（RPN）。右圖爲PASCAL VOC 2007測試集上使用RPN提議的範例檢測。我們的方法可以檢測各種尺度和長寬比的目標。

T ranslation-Invariant Anchors
An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors. If one translates an object in an image, the proposal should translate and the same function should be able to predict the proposal in either location. This translation-invariant property is guaranteed by our method. As a comparison, the MultiBox method [27] uses k-means to generate 800 anchors, which are not translation invariant. So MultiBox does not guarantee that the same proposal is generated if an object is translated.
The translation-invariant property also reduces the model size. MultiBox has a (4+1)×800-dimensional fully-connected output layer, whereas our method has a (4+2)×9-dimensional convolutional output layer in the case of k=9 anchors. As a result, our output layer has 2.8×104 parameters (512×(4+2)×9 for VGG-16), two orders of magnitude fewer than MultiBox’s output layer that has 6.1×106 parameters (1536×(4+1)×800 for GoogleNet [34] in MultiBox [27]). If considering the feature projection layers, our proposal layers still have an order of magnitude fewer parameters than MultiBox. We expect our method to have less risk of overfitting on small datasets, like PASCAL VOC.
平移不變的Anchors
我們的方法的一個重要特性是它是平移不變的，無論是在anchors還是計算相對於anchors的region proposal的函數。如果在影象中平移目標，proposal應該平移，並且同樣的函數應該能夠在任一位置預測proposal。平移不變特性是由我們的方法保證的。作爲對比，MultiBox方法[27]使用k-means生成800個anchors，這不是平移不變的。所以如果平移目標，MultiBox不保證會生成相同的proposal。
平移不變特性也減小了模型的大小。MultiBox有(4+1)×800維的全連線輸出層，而我們的方法在k=9個anchors的情況下有(4+2)×9維的折積輸出層。因此，我們的輸出層具有2.8×104個參數（其中VGG-16爲512×(4+2)×9個），比MultiBox輸出層的6.1×106個參數少了兩個數量級（其中MultiBox [27]中的GoogleNet[34]爲1536×(4+1)×800個）。如果考慮到特徵投影層，我們的proposal層仍然比MultiBox少一個數量級。我們期望我們的方法在PASCAL VOC等小數據集上有更小的過擬合風險。

Multi-Scale Anchors as Regression References
Our design of anchors presents a novel scheme for addressing multiple scales (and aspect ratios). As shown in Figure 1, there have been two popular ways for multi-scale predictions. The first way is based on image/feature pyramids, e.g., in DPM [8] and CNN-based methods [9], [1], [2]. The images are resized at multiple scales, and feature maps (HOG [8] or deep convolutional features [9], [1], [2]) are computed for each scale (Figure 1(a)). This way is often useful but is time-consuming. The second way is to use sliding windows of multiple scales (and/or aspect ratios) on the feature maps. For example, in DPM [8], models of different aspect ratios are trained separately using different filter sizes (such as 5×7 and 7×5). If this way is used to address multiple scales, it can be thought of as a 「pyramid of filters」 (Figure 1(b)). The second way is usually adopted jointly with the first way [8].
As a comparison, our anchor-based method is built on a pyramid of anchors, which is more cost-efficient. Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. We show by experiments the effects of this scheme for addressing multiple scales and sizes (Table 8).
Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image, as is also done by the Fast R-CNN detector [2]. The design of multi-scale anchors is a key component for sharing features without extra cost for addressing scales.
多尺度Anchors作爲迴歸參考
我們的anchors設計提出了一個新的方案來解決多尺度（和長寬比）。如圖1所示，多尺度預測有兩種流行的方法。第一種方法是基於影象/特徵金字塔，例如DPM[8]和基於CNN的方法[9]，[1]，[2]。影象在多個尺度上進行縮放，並且針對每個尺度（圖1（a））計算特徵對映（HOG[8]或深折積特徵[9]，[1]，[2]）。這種方法通常是有用的，但是非常耗時。第二種方法是在特徵對映上使用多尺度（和/或長寬比）的滑動視窗。例如，在DPM[8]中，使用不同的濾波器大小（例如5×7和7×5）分別對不同長寬比的模型進行訓練。如果用這種方法來解決多尺度問題，可以把它看作是一個「濾波器金字塔」（圖1（b））。第二種方法通常與第一種方法聯合採用[8]。
作爲比較，我們基於anchor的方法建立在anchors金字塔上，這是更加高效的方法。我們的方法參照多尺度和長寬比的anchor框來分類和迴歸邊界框。它只依賴單一尺度的影象和特徵對映，並使用單一尺寸的濾波器（特徵對映上的滑動視窗）。我們通過實驗來展示這個方案解決多尺度和尺寸的效果（表8）。
由於這種基於anchors的多尺度設計，我們可以簡單地使用在單尺度影象上計算的折積特徵，Fast R-CNN檢測器也是這樣做的[2]。多尺度anchors設計是共用特徵的關鍵元件，不需要額外的成本來處理尺度。
在这里插入图片描述
表8：Faster R-CNN在PAS-CAL VOC 2007測試數據集上使用不同anchors設定的檢測結果。網路是VGG-16。訓練數據是VOC 2007訓練集。使用3個尺度和3個長寬比（69.9%）的預設設定，與表3中的相同。

3.1.2 Loss Function損失函數

For training RPNs, we assign a binary class label (of being an object or not) to each anchor. We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box. Note that a single ground-truth box may assign positive labels to multiple anchors. Usually the second condition is sufficient to determine the positive samples; but we still adopt the first condition for the reason that in some rare cases the second condition may find no positive sample. We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.
爲了訓練RPN，我們爲每個anchor分配一個二值類別標籤（是目標或不是目標）。我們給這兩種anchor分配一個正樣本標籤：（i）具有與真實邊界框的重疊最高交併比（IoU）的anchor，或者（ii）具有與真實邊界框的重疊超過0.7 IoU的anchor。注意，單個真實邊界框可以爲多個anchor分配正標籤。通常第二個條件足以確定正樣本；但我們仍然採用第一個條件，因爲在一些極少數情況下，第二個條件可能找不到正樣本。對於所有的真實邊界框，如果一個anchor的IoU比率低於0.3，我們給非正樣本的anchor分配一個負標籤。既不是正樣本標籤也不是負樣本標籤的anchor對訓練目標函數沒有作用。

With these definitions, we minimize an objective function following the multi-task loss in Fast R-CNN [2]. Our loss function for an image is defined as:
根據這些定義，我們根據Fast R-CNN[2]中的多工損失對目標函數進行最小化。我們對某一個影象的損失函數定義爲：
在这里插入图片描述

Here, i is the index of an anchor in a mini-batch and p_i is the predicted probability of anchor i being an object. The ground-truth label p^*_i is 1 if the anchor is positive, and is 0 if the anchor is negative. t_i is a vector representing the 4 parameterized coordinates of the predicted bounding box, and t^*_i is that of the ground-truth box associated with a positive anchor. The classification loss L_cls is log loss over two classes (object vs not object). For the regression loss, we use L_reg(t_i, t^*_i)=R(ti-t^*_i) where R is the robust loss function (smooth L1) defined in [2]. The term p^*_i L_reg means the regression loss is activated only for positive anchors (p^*_i=1) and is disabled otherwise (p^*_i=0). The outputs of the cls and reg layers consist of {p_i} and {t_i} respectively.
其中，i是小批次數據中anchor的索引，pi表示anchor i是目標的預測概率。如果anchor爲正樣本標籤，則真實標籤pi爲1，如果anchor爲負樣本標籤，則爲0。ti是表示預測邊界框的4個參數化座標組成的向量，而ti是與正標籤anchor相關聯的真實邊界框座標組成的向量。分類損失Lcls是兩個類別上（是目標或不是目標）的對數損失。對於迴歸損失，我們使用Lreg(ti, ti)=R(ti - ti)，其中R是在[2]中定義的魯棒損失函數（L1平滑函數）。pi Lreg項表示迴歸損失僅對於正樣本anchor（pi=1）有效，否則無效（p*i=0）。cls和reg層的輸出分別由pi和ti組成。

The two terms are normalized by Nclsand Nreg and weighted by a balancing parameter λ. In our current implementation (as in the released code), the cls term in Eqn.(1) is normalized by the mini-batch size (i.e., Ncls= 256) and the reg term is normalized by the number of anchor locations (i.e., Nreg∼ 2,400).By default we set λ = 10, and thus both cls and reg terms are roughly equally weighted. We show by experiments that the results are insensitive to the values of λ in a wide range (Table 9). We also note that the normalization as above is not required and could be simplified.
這兩個項用Ncls和Nreg進行標準化，並由一個平衡參數λ加權。在我們目前的實現中（如在發佈的程式碼中），方程（1）中的cls項通過小批次數據的大小（即Ncls=256）進行歸一化，reg項根據anchor位置的數量（即Nreg~24000）進行歸一化。預設情況下，我們設定λ=10，因此cls和reg項的權重大致相等。我們通過實驗表明，結果對寬範圍的λ值不敏感(表9)。我們還注意到，上面的歸一化不是必需的，可以簡化。

在这里插入图片描述
表9：Faster R-CNN使用方程(1)中不同的λ值在PASCAL VOC 2007測試集上的檢測結果。網路是VGG-16。訓練數據是VOC 2007訓練集。使用λ= 10（69.9%）的預設設定與表3中的相同。

For bounding box regression, we adopt the param-eterizations of the 4 coordinates following [5]:
在这里插入图片描述
where x, y, w, and h denote the box’s center coordinates and its width and height. Variables x, xa, and x* are for the predicted box, anchor box, and ground-truth box respectively (likewise for y, w, h). This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.
對於邊界框迴歸，我們採用[5]中的4個座標參數化：
其中，x、y、w、h表示邊界框的中心座標及其寬和高。變數x、xa、和x*分別表示預測邊界框、anchor框和邊界框真值（類似於y、w、h與其類似）。這可以被認爲是從anchor框到相近實際邊界框的迴歸。

Nevertheless, our method achieves bounding-box regression by a different manner from previous RoI-based (Region of Interest) methods [1], [2]. In [1], [2], bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.
然而，我們的方法通過與之前的基於RoI（感興趣區域）方法[1] [2]不同的方式來實現邊界框迴歸。在[1]，[2]中，對任意大小的RoI池化的特徵執行邊界框迴歸，並且迴歸權重由所有區域大小共用。在我們的公式中，用於迴歸的特徵在特徵圖上具有相同的空間大小（3×3）。爲了說明不同的大小，學習一組k個邊界框迴歸器。每個迴歸器負責一個尺度和一個長寬比，而k個迴歸器不共用權重。因此，由於anchor的設計，即使特徵具有固定的尺度/比例，仍然可以預測各種尺寸的邊界框。

3.1.3 Training RPNs

The RPN can be trained end-to-end by back-propagation and stochastic gradient descent (SGD) [35]. We follow the 「image-centric」 sampling strategy from [2] to train this network. Each mini-batch arises from a single image that contains many positive and negative example anchors. It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples as they are dominate. Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones.
RPN可以通過反向傳播和隨機梯度下降（SGD）進行端對端訓練[35]。我們遵循[2]中的「影象中心」採樣策略來訓練這個網路。每個小批次數據都從包含許多正樣本和負樣本anchor的單張影象中產生。對所有anchor的損失函數進行優化是可能的，但是這樣會偏向於負樣本，因爲它們是佔大部分的。取而代之的是，我們在影象中隨機採樣256個anchor，計算一個小批次數據的損失函數，其中採樣的正anchor和負anchor的比率可達1:1。如果影象中的正樣本少於128個，我們使用負樣本填充小批次數據。

We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. All other layers (i.e., the shared convolutional layers) are initialized by pre-training a model for ImageNet classification [36], as is standard practice [5]. We tune all layers of the ZF net, and conv3_1 and up for the VGG net to conserve memory [2]. We use a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL VOC dataset. We use a momentum of 0.9 and a weight decay of 0.0005 [37]. Our implementation uses Caffe [38].
我們通過從標準方差爲0.01的零均值高斯分佈中獲得權重來隨機初始化所有新層。所有其他層（即共用折積層）通過預訓練的ImageNet分類模型[36]來初始化，如同標準操作[5]。我們fine-tune ZF網路的所有層，以及VGG網路的conv3_1及其之上的層以節省記憶體[2]。對於60k的小批次數據，我們使用0.001的學習率，對於PASCAL VOC數據集中的下一個20k小批次數據，使用0.0001。我們使用0.9的動量和0.0005的重量衰減[37]。我們的實現使用Caffe[38]。

3.2 Sharing Features for RPN and Fast R-CNN RPN和Fast R-CNN共用特徵

Thus far we have described how to train a network for region proposal generation, without considering the region-based object detection CNN that will utilize these proposals. For the detection network, we adopt Fast R-CNN [2]. Next we describe algorithms that learn a unified network composed of RPN and Fast R-CNN with shared convolutional layers (Figure 2).
到目前爲止，我們已經描述瞭如何訓練用於生成region proposal的網路，沒有提及將如何利用這些proposals的基於區域的目標檢測CNN。對於檢測網路，我們採用Fast R-CNN[2]。接下來我們介紹一些演算法，學習由RPN和Fast R-CNN組成的具有共用折積層的統一網路（圖2）。

Both RPN and Fast R-CNN, trained independently, will modify their convolutional layers in different ways. We therefore need to develop a technique that allows for sharing convolutional layers between the two networks, rather than learning two separate networks. We discuss three ways for training networks with features shared:
獨立訓練的RPN和Fast R-CNN將以不同的方式修改折積層。因此，我們需要開發一種允許在兩個網路之間共用折積層的技術，而不是學習兩個獨立的網路。我們討論三個方法來訓練具有共用特徵的網路：

(i) Alternating training. In this solution, we first train RPN, and use the proposals to train Fast R-CNN. The network tuned by Fast R-CNN is then used to initialize RPN, and this process is iterated. This is the solution that is used in all experiments in this paper.
（i）交替訓練。在這個解決方案中，我們首先訓練RPN，並使用這些proposals來訓練Fast R-CNN。由Fast R-CNN fine-tune的網路然後被用於初始化RPN，並且重複迭代這個過程。這是本文所有實驗中使用的解決方案。

(ii) Approximate joint training. In this solution, the RPN and Fast R-CNN networks are merged into one network during training as in Figure 2. In each SGD iteration, the forward pass generates region proposals which are treated just like fixed, pre-computed proposals when training a Fast R-CNN detector. The backward propagation takes place as usual, where for the shared layers the backward propagated signals from both the RPN loss and the Fast R-CNN loss are combined. This solution is easy to implement. But this solution ignores the derivative w.r.t. the proposal boxes’ coordinates that are also network responses, so is approximate. In our experiments, we have empirically found this solver produces close results, yet reduces the training time by about 25-50% comparing with alternating training. This solver is included in our released Python code.
（ii）近似聯合訓練。在這個解決方案中，RPN和Fast R-CNN網路在訓練期間合併成一個網路，如圖2所示。在每次SGD迭代中，前向傳遞生成region proposal，在訓練Fast R-CNN檢測器將這看作是固定的、預計算的提議。反向傳播像往常一樣進行，其中對於共用層，組合來自RPN損失和Fast R-CNN損失的反向傳播信號。這個解決方案很容易實現。但是這個解決方案忽略了關於proposals邊界框的座標（也是網路響應）的導數，因此是近似的。在我們的實驗中，我們實驗發現這個求解器產生了相當的結果，與交替訓練相比，訓練時間減少了大約25-50%。這個求解器包含在我們發佈的Python程式碼中。

(iii) Non-approximate joint training. As discussed above, the bounding boxes predicted by RPN are also functions of the input. The RoI pooling layer [2] in Fast R-CNN accepts the convolutional features and also the predicted bounding boxes as input, so a theoretically valid backpropagation solver should also involve gradients w.r.t. the box coordinates. These gradients are ignored in the above approximate joint training. In a non-approximate joint training solution, we need an RoI pooling layer that is differentiable w.r.t. the box coordinates. This is a nontrivial problem and a solution can be given by an 「RoI warping」 layer as developed in [15], which is beyond the scope of this paper.
（iii）非近似的聯合訓練。如上所述，由RPN預測的邊界框也是輸入的函數。Fast R-CNN中的RoI池化層[2]接受折積特徵以及預測的邊界框作爲輸入，所以理論上有效的反向傳播求解器也應該包括關於邊界框座標的梯度。在上述近似聯合訓練中，這些梯度被忽略。在一個非近似的聯合訓練解決方案中，我們需要一個關於邊界框座標可微分的RoI池化層。這是一個重要的問題，可以通過[15]中提出的「RoI扭曲」層給出解決方案，這超出了本文的範圍。

4-Step Alternating Training. In this paper, we adopt a pragmatic 4-step training algorithm to learn shared features via alternating optimization. In the first step, we train the RPN as described in Section 3.1.3. This network is initialized with an ImageNet-pre-trained model and fine-tuned end-to-end for the region proposal task. In the second step, we train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN. This detection network is also initialized by the ImageNet-pre-trained model. At this point the two networks do not share convolutional layers. In the third step, we use the detector network to initialize RPN training, but we fix the shared convolutional layers and only fine-tune the layers unique to RPN. Now the two networks share convolutional layers. Finally, keeping the shared convolutional layers fixed, we fine-tune the unique layers of Fast R-CNN. As such, both networks share the same convolutional layers and form a unified network. A similar alternating training can be run for more iterations, but we have observed negligible improvements.
四步交替訓練. 在本文中，我們採用實用的四步訓練演算法，通過交替優化學習共用特徵。在第一步中，我們按照3.1.3節的描述訓練RPN。該網路使用ImageNet的預訓練模型進行初始化，並針對region proposal任務進行了端到端的fine-tune。在第二步中，我們使用由第一步RPN生成的提議，由Fast R-CNN訓練單獨的檢測網路。該檢測網路也由ImageNet的預訓練模型進行初始化。此時兩個網路不共用折積層。在第三步中，我們使用檢測器網路來初始化RPN訓練，但是我們修正共用的折積層，並且只對RPN特有的層進行fine-tune。現在這兩個網路共用折積層。最後，保持共用折積層的固定，我們對Fast R-CNN的獨有層進行fine-tune。因此，兩個網路共用相同的折積層並形成統一的網路。類似的交替訓練可以執行更多的迭代，但是我們觀察改進只有一點點，甚至可以忽略。

3.3 Implementation Details 實現細節

We train and test both region proposal and object detection networks on images of a single scale [1], [2]. We re-scale the images such that their shorter side is s = 600 pixels [2]. Multi-scale feature extraction (using an image pyramid) may improve accuracy but does not exhibit a good speed-accuracy trade-off [2]. On the re-scaled images, the total stride for both ZF and VGG nets on the last convolutional layer is 16 pixels, and thus is ∼10 pixels on a typical PASCAL image before resizing (∼500×375). Even such a large stride provides good results, though accuracy may be further improved with a smaller stride.
我們在單尺度影象上訓練和測試region proposal和目標檢測網路[1]，[2]。我們再次縮放影象，使得它們的短邊是s=600畫素[2]。多尺度特徵提取（使用影象金字塔）可能會提高精度，但不會表現出速度與精度的良好權衡[2]。在重新縮放的影象上，最後折積層上的ZF和VGG網路的總步長爲16個畫素，因此在調整大小（_{500×375）之前，典型的PASCAL影象上的總步長爲}10個畫素。即使如此大的步長也能提供良好的效果，儘管步幅更小，精度可能會進一步提高。

For anchors, we use 3 scales with box areas of 1282, 2562, 5122 pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1. These hyper-parameters are not carefully chosen for a particular dataset, and we provide ablation experiments on their effects in the next section. As discussed, our solution does not need an image pyramid or filter pyramid to predict regions of multiple scales, saving considerable running time. Figure 3 (right) shows the capability of our method for a wide range of scales and aspect ratios. Table 1 shows the learned average proposal size for each anchor using the ZF net. We note that our algorithm allows predictions that are larger than the underlying receptive field. Such predictions are not impossible—one may still roughly infer the extent of an object if only the middle of the object is visible.
對於anchor，我們使用了3個尺度，邊界框面積分別爲128², 256²,512²個畫素，以及1:1，1:2和2:1的長寬比。這些超參數不是針對特定數據集精心挑選的，我們將在下一節中提供有關其作用的消融實驗。如上所述，我們的解決方案不需要影象金字塔或濾波器金字塔來預測多個尺度的區域，節省了大量的執行時間。圖3（右）顯示了我們的方法在廣泛的尺度和長寬比方面的能力。表1顯示了使用ZF網路的每個anchor學習到的平均提議大小。我們注意到，我們的演算法允許預測比基礎感受野更大的結果。這樣的預測不是不可能的——如果只有目標的中間部分是可見的，那麼仍然可以粗略地推斷出目標的範圍。
在这里插入图片描述
表1：使用ZF網路的每個anchor學習到的平均提議大小（s=600的數位）

The anchor boxes that cross image boundaries need to be handled with care. During training, we ignore all cross-boundary anchors so they do not contribute to the loss. For a typical 1000 × 600 image, there will be roughly 20000 (≈ 60×40×9) anchors in total. With the cross-boundary anchors ignored, there are about 6000 anchors per image for training. If the boundary-crossing outliers are not ignored in training, they introduce large, difficult to correct error terms in the objective, and training does not converge. During testing, however, we still apply the fully convolutional RPN to the entire image. This may generate cross-boundary proposal boxes, which we clip to the image boundary.
跨越影象邊界的anchor框需要謹慎處理。在訓練過程中，我們忽略了所有的跨邊界的anchor，所以不會造成損失。對於一個典型的1000 × 600的圖片，總共將會有大約20000（≈ 60×40×9）個anchor。跨界anchor被忽略，每張影象約有6000個anchor用於訓練。如果跨邊界異常值在訓練中不被忽略，則會在目標函數中引入大的、難以糾正的誤差項，且訓練不會收斂。但在測試過程中，我們仍然將全折積RPN應用於整張影象。當我們裁剪到影象邊界時，可能會產生跨邊界的proposal邊界框。

Some RPN proposals highly overlap with each other. To reduce redundancy, we adopt non-maximum suppression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2000 proposal regions per image. As we will show, NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following, we train Fast R-CNN using 2000 RPN proposals, but evaluate different numbers of proposals at test-time.
一些RPN proposals互相之間高度重疊。爲了減少冗餘，我們在proposals區域根據他們的cls分類得分採取非極大值抑制（NMS）。我們將NMS的IoU閾值固定爲0.7，這就給每張影象留下了大約2000個proposal regions。正如我們將要展示的那樣，NMS不會損害最終的檢測準確性，但會大大減少proposal的數量。在NMS之後，我們使用前N個proposal regions來進行檢測。接下來，我們使用2000個RPN proposal對Fast R-CNN進行訓練，但在測試時評估不同數量的proposal。

4. EXPERIMENTS

4.1 Experiments on P ASCAL VOC

We comprehensively evaluate our method on the PASCAL VOC 2007 detection benchmark [11]. This dataset consists of about 5k trainval images and 5k test images over 20 object categories. We also provide results on the PASCAL VOC 2012 benchmark for a few models. For the ImageNet pre-trained network, we use the 「fast」 version of ZF net [32] that has 5 convolutional layers and 3 fully-connected layers, and the public VGG-16 model [3] that has 13 convolutional layers and 3 fully-connected layers. We primarily evaluate detection mean Average Precision (mAP), because this is the actual metric for object detection (rather than focusing on object proposal proxy metrics).
我們在PASCAL VOC 2007檢測基準數據集[11]上全面評估了我們的方法。這個數據集包含大約5000張訓練評估影象和在20個目標類別上的5000張測試影象。我們還提供了一些模型在PASCAL VOC 2012基準數據集上的測試結果。對於ImageNet預訓練網路，我們使用具有5個折積層和3個全連線層的ZF網路[32]的「快速」版本以及具有13個折積層和3個全連線層的公開的VGG-16模型[3]。我們主要評估檢測的平均精度均值（mAP），因爲這是檢測目標的實際指標（而不是關注目標proposal代理指標）。

Table 2 (top) shows Fast R-CNN results when trained and tested using various region proposal methods. These results use the ZF net. For Selective Search (SS) [4], we generate about 2000 proposals by the 「fast」 mode. For EdgeBoxes (EB) [6], we generate the proposals by the default EB setting tuned for 0.7 IoU. SS has an mAP of 58.7% and EB has an mAP of 58.6% under the Fast R-CNN framework. RPN with Fast R-CNN achieves competitive results, with an mAP of 59.9% while using up to 300 proposals. Using RPN yields a much faster detection system than using either SS or EB because of shared convolutional computations; the fewer proposals also reduce the region-wise fully-connected layers’ cost (Table 5).
表2（上面）顯示了使用各種region proposal方法進行訓練和測試的Fast R-CNN結果。這些結果使用ZF網路。對於Selective Search（SS）[4]，我們通過「快速」模式生成約2000個proposals。對於EdgeBoxes（EB）[6]，我們通過調整0.7 IoU的預設EB設定生成proposals。在Fast R-CNN框架下SS的mAP爲58.7%，EB的mAP爲58.6%。RPN與Fast R-CNN取得了有競爭力的結果，使用多達300個proposals，mAP爲59.9%。由於共用折積計算，使用RPN比使用SS或EB產生了更快的檢測系統；較少的proposals也減少了region方面的全連線層成本（表5）。

在这里插入图片描述
表2：PASCAL VOC 2007測試集上的檢測結果（在VOC 2007訓練評估集上進行了訓練）。檢測器是帶有ZF的Fast R-CNN，但使用各種不同proposal方法進行訓練和測試。

表5：K40 GPU上的時間（ms），除了SS提議是在CPU上評估。「區域方面」包括NMS，池化，全連線和softmax層。檢視我們發佈的程式碼來分析執行時間。

Ablation Experiments on RPN. To investigate the behavior of RPNs as a proposal method, we conducted several ablation studies. First, we show the effect of sharing convolutional layers between the RPN and Fast R-CNN detection network. To do this, we stop after the second step in the 4-step training process. Using separate networks reduces the result slightly to 58.7% (RPN+ZF, unshared, Table 2). We observe that this is because in the third step when the detector-tuned features are used to fine-tune the RPN, the proposal quality is improved.
RPN上的消融實驗。爲了研究RPN作爲proposal方法的效能，我們進行了幾項消融研究。首先，我們顯示了RPN和Fast R-CNN檢測網路共用折積層的效果。爲此，我們在四步訓練過程的第二步之後停止訓練。使用單獨的網路將結果略微減少到58.7%（RPN+ZF，非共用，表2）。我們觀察到，這是因爲在第三步中，當使用檢測器調整的特徵來fine-tune RPN時，proposal品質得到了改善。

Next, we disentangle the RPN’s influence on training the Fast R-CNN detection network. For this purpose, we train a Fast R-CNN model by using the 2000 SS proposals and ZF net. We fix this detector and evaluate the detection mAP by changing the proposal regions used at test-time. In these ablation experiments, the RPN does not share features with the detector.
Replacing SS with 300 RPN proposals at test-time leads to an mAP of 56.8%. The loss in mAP is because of the inconsistency between the training/testing proposals. This result serves as the baseline for the following comparisons.
接下來，我們分析RPN對訓練Fast R-CNN檢測網路的影響。爲此，我們通過使用2000個SS proposals和ZF網路來訓練Fast R-CNN模型。我們固定這個檢測器，並通過改變測試時使用的proposal regions來評估檢測的mAP。在這些消融實驗中，RPN不與檢測器共用特徵。
在測試階段用300個RPN proposals替換SS proposals得到了56.8%的MAP。mAP的下降是因爲訓練/測試proposals不一致。這個結果作爲以下比較的基準。

Somewhat surprisingly, the RPN still leads to a competitive result (55.1%) when using the top-ranked 100 proposals at test-time, indicating that the top-ranked RPN proposals are accurate. On the other extreme, using the top-ranked 6000 RPN proposals (without NMS) has a comparable mAP (55.2%), suggesting NMS does not harm the detection mAP and may reduce false alarms.
Next, we separately investigate the roles of RPN’s cls and reg outputs by turning off either of them at test-time. When the cls layer is removed at test-time (thus no NMS/ranking is used), we randomly sample N proposals from the unscored regions. The mAP is nearly unchanged with N=1000 (55.8%), but degrades considerably to 44.6% when N=100. This shows that the cls scores account for the accuracy of the highest ranked proposals.
有些令人驚訝的是，RPN在測試時使用排名最高的100個proposals仍然會獲得有競爭力的結果（55.1%），表明排名靠前的RPN proposals是準確的。相反的，使用排名靠前的6000個RPN proposals（沒有進行NMS）具有相當的mAP（55.2%），這表明NMS不會損害檢測mAP並可能減少誤報。
接下來，我們通過在測試時分別關閉RPN的cls和reg輸出來研究RPN的作用。當cls層在測試時被移除（因此不使用NMS/排名），我們從沒有計分的區域中隨機採樣N個proposals。當N=1000(55.8%)時，mAP幾乎沒有變化，但是當N=100時，會大幅降低到44.6%。這表明cls分數考慮了排名最高的proposals的準確性。

On the other hand, when the reg layer is removed at test-time (so the proposals become anchor boxes), the mAP drops to 52.1%. This suggests that the high-quality proposals are mainly due to the regressed box bounds. The anchor boxes, though having multiple scales and aspect ratios, are not sufficient for accurate detection.
We also evaluate the effects of more powerful networks on the proposal quality of RPN alone. We use VGG-16 to train the RPN, and still use the above detector of SS+ZF. The mAP improves from 56.8% (using RPN+ZF) to 59.2% (using RPN+VGG). This is a promising result, because it suggests that the proposal quality of RPN+VGG is better than that of RPN+ZF. Because proposals of RPN+ZF are competitive with SS (both are 58.7% when consistently used for training and testing), we may expect RPN+VGG to be better than SS. The following experiments justify this hypothesis.
另一方面，當在測試階段移除reg層（所以proposals變成anchor框）時，mAP將下降到52.1%。這表明高品質的proposals主要是由於迴歸的邊界框。anchor框雖然具有多個尺度和長寬比，但不足以進行準確的檢測。
我們還單獨評估了更強大的網路對RPN proposal品質的影響。我們使用VGG-16來訓練RPN，仍然使用上述的SS+ZF檢測器。mAP從56.8%（使用RPN+ZF）提高到59.2%（使用RPN+VGG）。這是一個很有希望的結果，因爲這表明RPN+VGG的proposal品質要好於RPN+ZF。由於RPN+ZF的proposal與SS具有競爭性（當一致用於訓練和測試時，都是58.7%），所以我們可以預期RPN+VGG比SS更好。以下實驗驗證了這個假設。

Performance of VGG-16. Table 3 shows the results of VGG-16 for both proposal and detection. Using RPN+VGG, the result is 68.5% for unshared features, slightly higher than the SS baseline. As shown above, this is because the proposals generated by RPN+VGG are more accurate than SS. Unlike SS that is pre-defined, the RPN is actively trained and benefits from better networks. For the feature-shared variant, the result is 69.9%——better than the strong SS baseline, yet with nearly cost-free proposals. We further train the RPN and detection network on the union set of PASCAL VOC 2007 trainval and 2012 trainval. The mAP is 73.2%. Figure 5 shows some results on the PASCAL VOC 2007 test set. On the PASCAL VOC 2012 test set (Table 4), our method has an mAP of 70.4% trained on the union set of VOC 2007 trainval+test and VOC 2012 trainval. Table 6 and Table 7 show the detailed numbers.
VGG-16的效能。表3顯示了VGG-16的proposal和檢測結果。使用RPN+VGG，非共用特徵的結果是68.5%，略高於SS基準模型。如上所示，這是因爲RPN+VGG生成的proposal比SS更準確。與預先定義的SS不同，RPN是主動訓練的並從更好的網路中受益。對於特性共用的變種，結果是69.9%——比強壯的SS基準模型更好，但幾乎是零成本的proposal。我們在PASCAL VOC 2007和2012的訓練評估數據集上進一步訓練RPN和檢測網路。該mAP是73.2%。圖5顯示了PASCAL VOC 2007測試集的一些結果。在PASCAL VOC 2012測試集（表4）中，我們的方法在VOC 2007的trainval+test和VOC 2012的trainval的聯合數據集上訓練的模型取得了70.4%的mAP。表6和表7所示爲詳細的數位。
在这里插入图片描述
表3：PASCAL VOC 2007測試集的檢測結果。檢測器是Fast R-CNN和VGG-16。訓練數據：「07」代表VOC 2007 trainval，「07 + 12」代表VOC 2007 trainval和VOC 2012 trainval的聯合訓練集。對於RPN，訓練時Fast R-CNN的proposals數量爲2000。†：[2]中報道的數位；使用本文提供的倉庫程式碼，這個結果更高（68.1）。
表4：PASCAL VOC 2012測試集的檢測結果。檢測器是Fast R-CNN和VGG-16。訓練數據：「07」代表VOC 2007 trainval，「07 + 12」代表VOC 2007 trainval和VOC 2012 trainval的聯合訓練集。對於RPN，訓練時Fast R-CNN的提議數量爲2000。†：http://host.robots.ox.ac.uk:8080/anonymous/HZJTQA.html。‡：http://host.robots.ox.ac.uk:8080/anonymous/YNPLXB.html。§：http://host.robots.ox.ac.uk:8080/anonymous/XEDH10.html。

在这里插入图片描述
表6：使用Fast R-CNN檢測器和VGG-16在PASCAL VOC 2007測試集上的結果。對於RPN，訓練時Fast R-CNN的提議數量爲2000。RPN*表示沒有共用特徵的版本。
表7：使用Fast R-CNN檢測器和VGG-16在PASCAL VOC 2012測試集上的結果。對於RPN，訓練時Fast R-CNN的提議數量爲2000。

在这里插入图片描述
圖5：使用Faster R-CNN系統在PASCAL VOC 2007測試集上目標檢測結果的幾個範例。該模型是VGG-16，訓練數據是07+12 trainval（2007年測試集中73.2%的mAP）。我們的方法檢測廣泛的尺度和長寬比目標。每個輸出框都與類別標籤和[0，1]之間的softmax分數相關聯。使用0.6的分數閾值來顯示這些影象。獲得這些結果包括所有步驟的執行時間爲每張影象198ms。

In Table 5 we summarize the running time of the entire object detection system. SS takes 1-2 seconds depending on content (on average about 1.5s), and Fast R-CNN with VGG-16 takes 320ms on 2000 SS proposals (or 223ms if using SVD on fully-connected layers [2]). Our system with VGG-16 takes in total 198ms for both proposal and detection. With the convolutional features shared, the RPN alone only takes 10ms computing the additional layers. Our region-wise computation is also lower, thanks to fewer proposals (300 per image). Our system has a frame-rate of 17 fps with the ZF net.
在表5中我們總結了整個目標檢測系統的執行時間。根據內容（平均大約1.5s）SS需要1-2秒，而使用VGG-16的Fast R-CNN在2000個SS proposals上需要320ms（如果在全連線層上使用SVD[2]，則需要223ms）。我們的VGG-16系統在proposals和檢測上總共需要198ms。在共用折積特徵的情況下，單獨RPN只需要10ms計算附加層。我們的區域計算也較低，這要歸功於較少的proposals（每張圖片300個）。我們的採用ZF網路的系統，幀速率爲17fps。

Sensitivities to Hyper-parameters. In Table 8 we investigate the settings of anchors. By default we use 3 scales and 3 aspect ratios (69.9% mAP in Table 8). If using just one anchor at each position, the mAP drops by a considerable margin of 3-4%. The mAP is higher if using 3 scales (with 1 aspect ratio) or 3 aspect ratios (with 1 scale), demonstrating that using anchors of multiple sizes as the regression references is an effective solution. Using just 3 scales with 1 aspect ratio (69.8%) is as good as using 3 scales with 3 aspect ratios on this dataset, suggesting that scales and aspect ratios are not disentangled dimensions for the detection accuracy. But we still adopt these two dimensions in our designs to keep our system flexible.
對超參數的敏感度。在表8中，我們研究了anchor的設定。預設情況下，我們使用3個尺度和3個長寬比（表8中69.9%的mAP）。如果在每個位置只使用一個anchor，那麼mAP的下降幅度將是3-4%。如果使用3個尺度（1個長寬比）或3個長寬比（1個尺度），則mAP更高，表明使用多種尺寸的anchor作爲迴歸參考是有效的解決方案。在這個數據集上，僅使用具有1個長寬比（69.8%）的3個尺度與使用具有3個長寬比的3個尺度一樣好，這表明尺度和長寬比不是檢測準確度的解決維度。但我們仍然在設計中採用這兩個維度來保持我們的系統靈活性。

In Table 9 we compare different values of λ in Equation (1). By default we use λ=10 which makes the two terms in Equation (1) roughly equally weighted after normalization. Table 9 shows that our result is impacted just marginally (by ~1%) when λ is within a scale of about two orders of magnitude (1 to 100). This demonstrates that the result is insensitive to λ in a wide range.
在表9中，我們比較了公式（1）中λ的不同值。預設情況下，我們使用λ=10，這使方程（1）中的兩個項在歸一化之後大致相等地加權。表9顯示，當λ在大約兩個數量級（1到100）的範圍內時，我們的結果只是稍微受到影響（~1%）。這表明結果對寬範圍內的λ不敏感。

Analysis of Recall-to-IoU. Next we compute the recall of proposals at different IoU ratios with ground-truth boxes. It is noteworthy that the Recall-to-IoU metric is just loosely [19], [20], [21] related to the ultimate detection accuracy. It is more appropriate to use this metric to diagnose the proposal method than to evaluate it.
分析IoU召回率。接下來，我們使用邊界框真值來計算不同IoU比率的proposals召回率。值得注意的是，Recall-to-IoU度量與最終的檢測精度的基本不相關[19，20，21]。使用這個指標來診斷proposals方法比評估proposals方法更合適。

In Figure 4, we show the results of using 300, 1000, and 2000 proposals. We compare with SS and EB, and the N proposals are the top-N ranked ones based on the confidence generated by these methods. The plots show that the RPN method behaves gracefully when the number of proposals drops from 2000 to 300. This explains why the RPN has a good ultimate detection mAP when using as few as 300 proposals. As we analyzed before, this property is mainly attributed to the cls term of the RPN. The recall of SS and EB drops more quickly than RPN when the proposals are fewer.
在圖4中，我們顯示了使用300、1000和2000個proposals的結果。我們與SS和EB進行比較，根據這些方法產生的置信度，取排名前N個proposals即爲N proposals。從圖中可以看出，當proposals數量從2000個減少到300個時，RPN方法表現優雅。這就解釋了爲什麼RPN在使用300個proposals時具有良好的最終檢測mAP。正如我們之前分析過的，這個屬性主要歸因於RPN的cls項。當提議較少時，SS和EB的召回率下降的比RPN更快。
在这里插入图片描述
圖4：PASCAL VOC 2007測試集上的召回率和IoU重疊率。

One-Stage Detection vs. Two-Stage Proposal + Detection. The OverFeat paper [9] proposes a detection method that uses regressors and classifiers on sliding windows over convolutional feature maps. OverFeat is a one-stage, class-specific detection pipeline, and ours is a two-stage cascade consisting of class-agnostic proposals and class-specific detections. In OverFeat, the region-wise features come from a sliding window of one aspect ratio over a scale pyramid. These features are used to simultaneously determine the location and category of objects. In RPN, the features are from square (3× 3) sliding windows and predict proposals relative to anchors with different scales and aspect ratios. Though both methods use sliding windows, the region proposal task is only the first stage of Faster R-CNN —— the downstream Fast R-CNN detector attends to the proposals to refine them. In the second stage of our cascade, the region-wise features are adaptively pooled [1], [2] from proposal boxes that more faithfully cover the features of the regions. We believe these features lead to more accurate detections.
一階段檢測與兩階段proposals +檢測。OverFeat論文[9]提出了一種在折積特徵圖的滑動視窗上使用迴歸器和分類器的檢測方法。OverFeat是一個一階段、類別特定的檢測流程，而我們的是兩階段級聯，包括類不可知的proposals和類別特定的檢測。在OverFeat中，區域特徵來自一個尺度金字塔上一個長寬比的滑動視窗。這些特徵用於同時確定目標的位置和類別。在RPN中，這些特徵來自正方形（3× 3）滑動視窗，並且預測相對於anchor具有不同尺度和長寬比的proposals。雖然這兩種方法都使用滑動視窗，但region proposal任務只是Faster R-CNN的第一階段——下遊的Fast R-CNN檢測器會致力於對proposals進行細化。在我們級聯的第二階段，在更忠實覆蓋區域特徵的提議框中，區域特徵自適應地聚集[1]，[2]。我們相信這些功能會帶來更準確的檢測結果。

To compare the one-stage and two-stage systems, we emulate the OverFeat system (and thus also circumvent other differences of implementation details) by one-stage Fast R-CNN. In this system, the 「proposals」 are dense sliding windows of 3 scales (128, 256, 512) and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN is trained to predict class-specific scores and regress box locations from these sliding windows. Because the OverFeat system adopts an image pyramid, we also evaluate using convolutional features extracted from 5 scales. We use those 5 scales as in [1], [2].
爲了比較一階段和兩階段系統，我們通過一階段Fast R-CNN來模擬OverFeat系統（從而也規避了實現細節的其他差異）。在這個系統中，「proposals」是3個尺度（128、256、512）和3個長寬比（1:1，1:2，2:1）的密集滑動視窗。訓練Fast R-CNN來預測類別特定的分數，並從這些滑動視窗中迴歸邊界框位置。由於OverFeat系統採用影象金字塔，我們也使用從5個尺度中提取的折積特徵進行評估。我們使用[1]，[2]中5個尺度。

Table 10 compares the two-stage system and two variants of the one-stage system. Using the ZF model, the one-stage system has an mAP of 53.9%. This is lower than the two-stage system (58.7%) by 4.8%. This experiment justifies the effectiveness of cascaded region proposals and object detection. Similar observations are reported in [2], [39], where replacing SS region proposals with sliding windows leads to ~6% degradation in both papers. We also note that the one-stage system is slower as it has considerably more proposals to process.
表10比較了兩階段系統和一階段系統的兩個變種。使用ZF模型，一階段系統具有53.9%的mAP。這比兩階段系統（58.7%）低4.8%。這個實驗驗證了級聯region proposal和目標檢測的有效性。在文獻[2]，[39]中報道了類似的觀察結果，在這兩篇論文中，用滑動窗取代SS region proposal會導致約6%的下降。我們也注意到，一階段系統更慢，因爲它產生了更多的proposals。
在这里插入图片描述
表10：一階段檢測與兩階段proposals +檢測。使用ZF模型和Fast R-CNN在PASCAL VOC 2007測試集上的檢測結果。RPN使用未共用的功能。

4.2 Experiments on MS COCO

We present more results on the Microsoft COCO object detection dataset [12]. This dataset involves 80 object categories. We experiment with the 80k images on the training set, 40k images on the validation set, and 20k images on the test-dev set. We evaluate the mAP averaged for IoU∈[0.5:0.05:0.95] (COCO’s standard metric, simply denoted as mAP@[.5, .95]) and [email protected] (PASCAL VOC’s metric).
我們呈現了在Microsoft COCO目標檢測數據集[12]上很多的結果。這個數據集包含80個目標類別。我們用訓練集上的8萬張影象、驗證集上的4萬張影象以及測試開發集上的2萬張影象進行實驗。我們評估了IoU∈[0.5:0.05:0.95]的平均mAP（COCO標準度量，簡稱爲mAP@[.5,.95]）和[email protected]（PASCAL VOC度量）。

There are a few minor changes of our system made for this dataset. We train our models on an 8-GPU implementation, and the effective mini-batch size becomes 8 for RPN (1 per GPU) and 16 for Fast R-CNN (2 per GPU). The RPN step and Fast R-CNN step are both trained for 240k iterations with a learning rate of 0.003 and then for 80k iterations with 0.0003. We modify the learning rates (starting with 0.003 instead of 0.001) because the mini-batch size is changed. For the anchors, we use 3 aspect ratios and 4 scales (adding 642), mainly motivated by handling small objects on this dataset. In addition, in our Fast R-CNN step, the negative samples are defined as those with a maximum IoU with ground truth in the interval of [0,0.5), instead of [0.1,0.5) used in [1], [2]. We note that in the SPPnet system [1], the negative samples in [0.1, 0.5) are used for network fine-tuning, but the negative samples in [0, 0.5) are still visited in the SVM step with hard-negative mining. But the Fast R-CNN system [2] abandons the SVM step, so the negative samples in [0,0.1) are never visited. Including these [0,0.1) samples improves [email protected] on the COCO dataset for both Fast R-CNN and Faster R-CNN systems (but the impact is negligible on PASCAL VOC).
我們的系統對這個數據集做了一些小的改動。我們在8個GPU上實現並訓練了我們的模型，RPN（每個GPU 1個）的有效小批次大小爲8，Fast R-CNN（每個GPU 2個）爲16。RPN和Fast R-CNN都都進行了24萬次迭代訓練，學習率爲0.003，然後以0.0003的學習率進行8萬次迭代。我們修改了學習率（從0.003而不是0.001開始），因爲小批次數據的大小發生了變化。對於anchor，我們使用3個長寬比和4個尺度（加上642），這主要是通過處理這個數據集上的小目標來激發的。此外，在我們的Fast R-CNN步驟中，負樣本定義爲與實際邊界框的最大IOU在[0，0.5)區間內的樣本，而不是[1]，[2]中使用的[0.1,0.5)之間。我們注意到，在SPPnet系統[1]中，在[0.1，0.5)中的負樣本用於網路fine-tune，但[0,0.5)中的負樣本仍然在具有難例挖掘SVM步驟中被存取。但是Fast R-CNN系統[2]放棄了SVM步驟，所以[0,0.1]中的負樣本都不會被存取。包括這些[0,0.1)的樣本，在Fast R-CNN和Faster R-CNN系統在COCO數據集上改進了[email protected]（但對PASCAL VOC的影響可以忽略不計）。

The rest of the implementation details are the same as on PASCAL VOC. In particular, we keep using 300 proposals and single-scale (s=600) testing. The testing time is still about 200ms per image on the COCO dataset.
In Table 11 we first report the results of the Fast R-CNN system [2] using the implementation in this paper. Our Fast R-CNN baseline has 39.3% [email protected] on the test-dev set, higher than that reported in [2]. We conjecture that the reason for this gap is mainly due to the definition of the negative samples and also the changes of the mini-batch sizes. We also note that the mAP@[.5, .95] is just comparable.
其餘的實現細節與PASCAL VOC相同。特別的是，我們繼續使用300個proposals和單一尺度（s=600）測試。COCO數據集上的測試時間仍然是大約200ms處理一張影象。
在表11中，我們首先報告了使用本文實現的Fast R-CNN系統[2]的結果。我們的Fast R-CNN基準模型在test-dev數據集上有39.3%的[email protected]，比[2]中報告的更高。我們推測造成這種差距的原因主要是由於負樣本的定義以及小批次大小的變化。我們也注意到mAP@[.5，.95]恰好相當。
在这里插入图片描述
表11：在MS COCO數據集上的目標檢測結果(%)。模型是VGG-16。

Next we evaluate our Faster R-CNN system. Using the COCO training set to train, Faster R-CNN has 42.1% [email protected] and 21.5% mAP@[.5, .95] on the COCO test-dev set. This is 2.8% higher for [email protected] and 2.2% higher for mAP@[.5, .95] than the Fast R-CNN counterpart under the same protocol (Table 11). This indicates that RPN performs excellent for improving the localization accuracy at higher IoU thresholds. Using the COCO trainval set to train, Faster R-CNN has 42.7% [email protected] and 21.9% mAP@[.5, .95] on the COCO test-dev set. Figure 6 shows some results on the MS COCO test-dev set.
接下來我們評估了我們的Faster R-CNN系統。使用COCO訓練集訓練，在COCO測試開發集上Faster R-CNNN有42.1%的[email protected]和21.5%的mAP@[0.5，0.95]。與相同設定下的Fast R-CNN相比，[email protected]要高2.8%，mAP@[.5, .95]要高2.2%（表11）。這表明，在更高的IoU閾值上，RPN對提高定位精度表現出色。使用COCO訓練集訓練，在COCO測試開發集上Faster R-CNN有42.7%的[email protected]和21.9%的mAP@[.5, .95]。圖6顯示了MS COCO測試開發數據集中的一些結果。
在这里插入图片描述
Figure 6: Selected examples of object detection results on the MS COCO test-dev set using the Faster R-CNN system. The model is VGG-16 and the training data is COCO trainval (42.7% [email protected] on the test-dev set).Each output box is associated with a category label and a softmax score in [0,1]. A score threshold of 0.6 is used to display these images. For each image, one color represents one object category in that image.
圖6：使用Faster R-CNN系統在MS COCO test-dev數據集上目標檢測結果的一些範例。該模型是VGG-16，訓練數據是COCO訓練數據（在測試開發數據集上爲42.7%的[email protected]）。每個輸出框都與一個類別標籤和[0, 1]之間的softmax分數相關聯。使用0.6的分數閾值來顯示這些影象。對於每張影象，一種顏色表示該影象中的一個目標類別。

Faster R-CNN in ILSVRC & COCO 2015 competitions. We have demonstrated that Faster R-CNN benefits more from better features, thanks to the fact that the RPN completely learns to propose regions by neural networks. This observation is still valid even when one increases the depth substantially to over 100 layers [18]. Only by replacing VGG-16 with a 101-layer residual net (ResNet-101) [18], the Faster R-CNN system increases the mAP from 41.5 %/21.2% (VGG-16) to 48.4%/27.2% (ResNet-101) on the COCO val set. With other improvements orthogonal to Faster R-CNN, He et al. [18] obtained a single-model result of 55.7%/34.9% and an ensemble result of 59.0%/37.4% on the COCO test-dev set, which won the 1st place in the COCO 2015 object detection competition. The same system [18] also won the 1st place in the ILSVRC 2015 object detection competition, surpassing the second place by absolute 8.5%. RPN is also a building block of the 1st-place winning entries in ILSVRC 2015 localization and COCO 2015 segmentation competitions, for which the details are available in [18] and [15] respectively.
在ILSVRC和COCO 2015比賽中的Faster R-CNN。我們已經證明，由於RPN通過神經網路完全學習了propose regions，Faster R-CNN從更好的特徵中受益更多。即使將深度增加到100層以上，這種觀察仍然是有效的[18]。僅用101層殘差網路（ResNet-101）代替VGG-16，Faster R-CNN系統就將mAP從41.5%/21.2%（VGG-16）增加到48.4%/27.2%（ResNet-101）。與其他改進正交於Faster R-CNN，何愷明等人[18]在COCO測試開發數據集上獲得了單模型55.7%/34.9%的結果和59.0%/37.4%的組合結果，在COCO 2015目標檢測競賽中獲得了第一名。同樣的系統[18]也在ILSVRC 2015目標檢測競賽中獲得了第一名，超過第二名絕對的8.5%。RPN也是ILSVRC2015定位和COCO2015分割競賽第一名獲獎輸入的基石，詳情請分別參見[18]和[15]。

4.3 From MS COCO to PASCAL VOC

Large-scale data is of crucial importance for improving deep neural networks. Next, we investigate how the MS COCO dataset can help with the detection performance on PASCAL VOC.
As a simple baseline, we directly evaluate the COCO detection model on the PASCAL VOC dataset, without fine-tuning on any PASCAL VOC data. This evaluation is possible because the categories on COCO are a superset of those on PASCAL VOC. The categories that are exclusive on COCO are ignored in this experiment, and the softmax layer is performed only on the 20 categories plus background. The mAP under this setting is 76.1% on the PASCAL VOC 2007 test set (Table 12). This result is better than that trained on VOC07+12 (73.2%) by a good margin, even though the PASCAL VOC data are not exploited.
大規模數據對改善深度神經網路至關重要。接下來，我們研究了MS COCO數據集如何幫助改進在PASCAL VOC上的檢測效能。
作爲一個簡單的基準模型，我們直接在PASCAL VOC數據集上評估COCO檢測模型，而無需在任何PASCAL VOC數據上進行fine-tune。這種評估是可行的，因爲COCO類別是PASCAL VOC上類別的超集。在這個實驗中忽略COCO專有的類別，softmax層僅在20個類別和背景上執行。這種設定下PASCAL VOC 2007測試集上的mAP爲76.1%（表12）。即使沒有利用PASCAL VOC的數據，這個結果也好於在VOC07+12(73.2%)上訓練的模型的結果。

Then we fine-tune the COCO detection model on the VOC dataset. In this experiment, the COCO model is in place of the ImageNet-pre-trained model (that is used to initialize the network weights), and the Faster R-CNN system is fine-tuned as described in Section 3.2. Doing so leads to 78.8% mAP on the PASCAL VOC 2007 test set. The extra data from the COCO set increases the mAP by 5.6%. Table 6 shows that the model trained on COCO+VOC has the best AP for every individual category on PASCAL VOC 2007. Similar improvements are observed on the PASCAL VOC 2012 test set (Table 12 and Table 7). We note that the test-time speed of obtaining these strong results is still about 200ms per image.
然後我們在VOC數據集上對COCO檢測模型進行fine-tune。在這個實驗中，COCO模型代替了ImageNet的預訓練模型（用於初始化網路權重），Faster R-CNN模型按3.2節所述進行fine-tune。這樣做在PASCAL VOC 2007測試集上可以達到78.8%的mAP。來自COCO集合的額外數據增加了5.6%的mAP。表6顯示，在PASCAL VOC 2007上，使用COCO+VOC訓練的模型在每個類別上具有最好的AP值。在PASCAL VOC 2012測試集（表12和表7）中也觀察到類似的改進。我們注意到獲得這些強大結果的測試時間速度仍然是每張影象200ms左右。
在这里插入图片描述
表12：使用不同的訓練數據在PASCAL VOC 2007測試集和2012測試集上檢測Faster R-CNN的檢測mAP（％）。模型是VGG-16。「COCO」表示COCO trainval數據集用於訓練。另見表6和表7。

5. CONCLUSION

We have presented RPNs for efficient and accurate region proposal generation. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free. Our method enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
我們已經提出了RPN來生成高效、準確的region proposal。通過與下遊檢測網路共用折積特徵，region proposal步驟幾乎是零成本的。我們的方法使統一的、基於深度學習的目標檢測系統能夠以接近實時的影格率執行。學習到的RPN也提高了region proposal的品質，從而提高了整體的目標檢測精度。

參考文獻

[1] K. He, X. Zhang, S. Ren, and J. Sun, 「Spatial pyramid pooling in deep convolutional networks for visual recognition,」 in European Conference on Computer Vision (ECCV), 2014.

[2] R. Girshick, 「Fast R-CNN,」 in IEEE International Conference on Computer Vision (ICCV), 2015.

[3] K. Simonyan and A. Zisserman, 「Very deep convolutional networks for large-scale image recognition,」 in International Conference on Learning Representations (ICLR), 2015.

[4] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, 「Selective search for object recognition,」 International
Journal of Computer Vision (IJCV), 2013.

[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, 「Rich feature hierarchies for accurate object detection and semantic segmentation,」 in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

[6] C. L. Zitnick and P. Dollár, 「Edge boxes: Locating object proposals from edges,」 in European Conference on Computer Vision(ECCV),2014.

[7] J. Long, E. Shelhamer, and T. Darrell, 「Fully convolutional networks for semantic segmentation,」 in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, 「Object detection with discriminatively trained part-based models,」 IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2010.

[9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, 「Overfeat: Integrated recognition, localization and detection using convolutional networks,」 in International Conference on Learning Representations (ICLR), 2014.

[10] S. Ren, K. He, R. Girshick, and J. Sun, 「FasterR-CNN: Towards real-time object detection with region proposal networks,」 in
Neural Information Processing Systems (NIPS), 2015.

[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, 「The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,」 2007.

[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, 「Microsoft COCO: Common Objects in Context,」 in European Conference on Computer Vision (ECCV), 2014.

[13] S. Song and J. Xiao, 「Deep sliding shapes for amodal 3d object detection in rgb-d images,」 arXiv:1511.02300, 2015.

[14] J. Zhu, X. Chen, and A. L. Yuille, 「DeePM: A deep part-based model for object detection and semantic part localization,」 arXiv:1511.07131, 2015.

[15] J. Dai, K. He, and J. Sun, 「Instance-aware semantic segmentation via multi-task network cascades,」 arXiv:1512.04412, 2015.

[16] J. Johnson, A. Karpathy, and L. Fei-Fei, 「Densecap: Fully convolutional localization networks for dense captioning,」 arXiv:1511.07571, 2015.

[17] D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, 「Human curation and convnets: Powering item-to-item recommendations on pinterest,」 arXiv:1511.04003, 2015.

[18] K. He, X. Zhang, S. Ren, and J. Sun, 「Deep residual learning for image recognition,」 arXiv:1512.03385, 2015.

[19] J. Hosang, R. Benenson, and B. Schiele, 「How good are detection proposals, really?」 in British Machine Vision Conference (BMVC), 2014.

[20] J. Hosang, R. Benenson, P. Dollar, and B. Schiele, 「What makes for effective detection proposals?」 IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015.

[21] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra, 「Object-Proposal Evaluation Protocol is ’Gameable’,」 arXiv: 1505.05836, 2015.

[22] J. Carreira and C. Sminchisescu, 「CPMC: Automatic object segmentation using constrained parametric min-cuts,」 IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012.

[23] P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, 「Multiscale combinatorial grouping,」 in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

[24] B. Alexe, T. Deselaers, and V. Ferrari, 「Measuring the objectness of image windows,」 IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2012.

[25] C. Szegedy, A. Toshev, and D. Erhan, 「Deep neural networks for object detection,」 in Neural Information Processing Systems (NIPS), 2013.

[26] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, 「Scalable object detection using deep neural networks,」 in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

[27] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, 「Scalable, high-quality object detection,」 arXiv:1412.1441 (v1), 2015.

[28] P. O. Pinheiro, R. Collobert, and P. Dollar, 「Learning to segment object candidates,」 in Neural Information Processing Systems (NIPS), 2015.

[29] J. Dai, K. He, and J. Sun, 「Convolutional feature masking for joint object and stuff segmentation,」 in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[30] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, 「Object detection networks on convolutional feature maps,」 arXiv:1504.06066, 2015.

[31] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, 「Attention-based models for speech recognition,」 in Neural Information Processing Systems (NIPS), 2015.

[32] M. D. Zeiler and R. Fergus, 「Visualizing and understanding convolutional neural networks,」 in European Conference on Computer Vision (ECCV), 2014.

[33] V. Nair and G. E. Hinton, 「Rectified linear units improve restricted boltzmann machines,」 in International Conference on Machine Learning (ICML), 2010.

[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, and A. Rabinovich, 「Going deeper with convolutions,」 in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, 「Backpropagation applied to handwritten zip code recognition,」 Neural computation, 1989.

[36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, 「ImageNet Large Scale Visual Recognition Challenge,」 in International Journal of Computer Vision (IJCV), 2015.

[37] A. Krizhevsky, I. Sutskever, and G. Hinton, 「Imagenet classification with deep convolutional neural networks,」 in Neural Information Processing Systems (NIPS), 2012.

[38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, 「Caffe: Convolutional architecture for fast feature embedding,」 arXiv:1408.5093, 2014.

[39] K. Lenc and A. Vedaldi, 「R-CNN minus R,」 in British Machine Vision Conference (BMVC), 2015.