由於課題有關人員的Attention/Gaze/Saliency建模，所以最近讀了一些Saliency Detection 的文章。利用部落格一方面作爲學習筆記記錄一下，一方面也與大家進行交流。
第一篇文章是2020年ECCV， University of Oxford研究團隊的Unified Image and Video Saliency Modeling。這篇文章最大的特點就是整合了靜態影象和動態視訊的Saliency Detection功能，相比傳統以靜態圖Saliency Detection的連續輸出預測視訊，有明顯的創新之處。

論文

Abstract

開篇直接說明現在建模的gap，影象和視訊模型獨立分割的狀況，發出疑問能否融合有點進行統一建模

Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature.
Can image and video saliency modeling be approached via a unified model, with mutual benefit?

模型訓練和測試使用多個公開的影象和視訊 Saliency 數據集

影象數據集	視訊數據集
DHF1K, Hollywood-2 and UCF-Sports,	SALICON and MIT300

建模中引入了Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain Adaptive Smoothing and Bypass-RNN四個創新的方法。
結果：功能上可以通過參數控制實現預測模式的切換，模型參數量上實現了輕量化，且對比之前模型精度有所提升（必須打上666）。

Introduction

第一段隨便介紹saliency prediction/modeling；
第二段引入動態視訊數據集和模型；
第三段介紹靜態動態建模分割，尤其是一些需要輸入光流和固定幀數畫面的網路，因此作者發問了

Is it possible to model static and dynamic saliency via one unified framework, with mutual benefit?

第四段，作者想提出影象和視訊領域相互切換的模型the domain shift between image and video saliency data，即具有領域自適應技術domain adaption techniques特性的UNISAL neural network architecture（用詞真高階）。使用DHF1K, Hollywood-2 and UCF-Sports和SALICON數據集進行網路訓練。
第五段，使用上述四個訓練集的測試集合部分以及MIT300發現，Unified Image and Video Saliency Modeling效能卓越，outperforms current state-of-the-art methods on all video saliency datasets and achieves competitive performance for image saliency prediction。
第五段，總結主要貢獻點：

第一個提出unified saliency detection模型框架
提出了四項domain adaption techniques實現了不同任務的特徵共用
相比現有模型參數量有5-20倍的減少

Related Work

Image Saliency Modeling

Saliency Modeling歷史介紹，從Itti教授的bottom-up模型介紹到深度學習top-down，簡單介紹了近幾年的幾篇文章。引入到動態視訊

Video Saliency Modeling

幾類傳統方法：low-level visual statistics, with additional temporal features (e.g., optical flow);the center-surround saliency in static scenes。侷限性，limited by the representation ability of the low-level features for temporal information.
深度學習方法：via a multi-stream convolutional long short-term memory network (ConvLSTM)；
attention mechanism with ConvLSTM；3D convolutions。侷限於resulting in limited applicability to static scenes

Spatio-Temporal Visual Prediction

之前用於影象空間和時間特徵提取的方法，spatio-temporal motion patterns，the spatio-temporal domain by using LSTMs，the phase spectrum of the Quaternion Fourier Transform.限制於rendering the models unable to simultaneously model image saliency

Unified Image and Video Saliency Modeling

進入正題如何建模

Domain-Shift Modeling

首先對DHF1K, Hollywood-2 and UCF-Sports和SALICON數據集進行了域偏差分析the domain shift analyses。
處理的方法：隨機從四個數據集閤中抽取256幀畫面，輸入到pre-trained MobileNetV2網路中得到輸出的特徵向量後，使用t-SNE演算法降維視覺化。

Domain-Adaptive Batch Normalization

爲了減少數據集合之間的的均值方差，有利於神經網路的訓練，一個標準化的操作流程。
Batch normalization (BN) aims to reduce the internal covariate shift of neural network activations by transforming their distribution to zero mean and unit variance for each training batch.
採用兩種策略進行Normalization：

domain-invariant： all samples 所有樣本的均值方差
domain-adaptive： the samples from the respective dataset ，分別進行標準化

如下圖對比可見，經過標準化處理後的數據集在聚類後明顯聚類現象得到減緩。因此模型針對每個數據集合建立了對應的BM modules.
在这里插入图片描述

Domain-Adaptive Priors

不同數據集和的注視區域中央聚集偏差strongest center bias程度不同，但 Hollywood-2 and UCF 數據集合表現十分明顯，但影象數據集SALICON的散度又十分顯著，這可能是由於影象數據集每張圖5s的觀察時間所決定。因此提出learn a separate set of Gaussian prior maps for each dataset.
在这里插入图片描述

之前的試驗結果發現，車速越高，注視區域越小
危險場景注視區域分佈更廣

Domain-Adaptive Fusion

不同數據間的人員採集狀態不相同
Hollywood-2 and UCF Sports datasets are task-driven, the viewer is instructed to identify the main action shown.The DHF1K dataset contains free-viewing fixations.

這個跟下面下麪的討論有什麼關係？
回答不同的數據集是否需要不同的Fussion layer(1X1 convolution)進行特徵層融合

方法: 搭建一個簡單的saliency modeling，使用MNet V2提取特徵後，使用一個Fusion layer (1X1 convolution)進行特徵層提取，對比兩種建模方法的效能差異。

domain-invariant： same weights for all datasets，所有數據集採用相同的權重係數。
domain-adaptive： different weights for each dataset，不同數據選擇不同的權重係數。

結果如下，明顯domain-adaptive效能更爲突出。
在这里插入图片描述

Domain-Adaptive Smoothing

不同數據集的模糊濾波方法不相同blurring filter，將數據集resize到網路輸入後，計算數據得銳度sharpnnes分佈如下，可見存在較大差異。尤其是DHF1K有最大的銳度sharpness（maximum gradient）。
因此提出了針對數據集建立smoothing kernel blur the network output with a different learned Smoothing kernel for each dataset 在这里插入图片描述

UNISAL Network Architecture

總體是一個encoder-RNN-decoder框架結構
在这里插入图片描述

Encoder Network

主幹選用了MobileNet-V2三點原因：

由於網路參數相對較少，增大網路batch和視訊輸入幀
提高模型inference效率，實現實時監控
避免over fitting

由於採用的是ImageNet-pretrained parameters網路輸入標準化格式固定，Domain-Adaptive Batch Normalization並未被使用。

Gaussian Prior Maps

domain-adaptive Gaussian prior maps計算公式如下

$g^{(i)}(x, y)=\gamma \exp \left(-\frac{\left(x-\mu_{x}^{(i)}\right)^{2}}{\left(\sigma_{x}^{(i)}\right)^{2}}-\frac{\left(y-\mu_{y}^{(i)}\right)^{2}}{\left(\sigma_{y}^{(i)}\right)^{2}}\right)$
參數：gama= 6，由MNetV2 中RELU6決定。 we propose unconstrained Gaussianprior maps ，instead of drawing the initial Gaussian parameters from a normal distribution, which results in highly correlated maps
從b圖可以看出x, y成正太分佈。Gaussian Prior Maps放在了CNN和RNN之間，to model the static center bias；in order to leverage the prior maps in higher-level features

c圖沒有看懂
原文instead of drawing the initial Gaussian parameters from a normal distribution, which results in highly correlated maps, we initialize NG = 16 maps，避免受一個初始分佈的先驗影響，提出了c圖中16組不同的初始濾波組合。

在这里插入图片描述

Bypass-RNN

之前提取空間特徵的方法·RNN, optical flow or 3D convolutions都不適用於靜態圖。提出了Bypass-RNN模組，殘差連線的方式選擇提取影象特徵。

RNN whose output is added to its input features via a residual connection that is automatically omitted (bypassed) for static batches during training and inference

使用convolutional GRU (cGRU) RNN構建Bypass-RNN模組

Decoder Network and Smoothing

Decoder的細節如下，爲降低參數量使用了比較少多的深度可分離折積 depthwise separable convolution以及逐點折積pointwise 1x1 convolution。
The Post-US2 features are reduced to a single channel by an Domain-Adaptive Fusion layer (1x1 convolution) and upsampled to the input resolution via nearest-neighbor interpolation.
The upsampling is followed by a Domain- Adaptive Smoothing layer with 41x41 convolutional kernels that explicitly model the dataset-dependent blurring of the ground-truth saliency maps.
在这里插入图片描述

Domain-Aware Optimization

每個數據集合影象的長寬比不同，輸入網路經過Domain-Adaptive Batch Normalization，可以實現特定網路特定尺寸輸入。

MobileNet V2輸入尺寸不是固定的嗎

The images/frames of the training datasets each have different aspect ratios, specifically 4:3 for SALICON, 16:9 for DHF1K, and 1.85:1 (median) for Hollywood-2, and 3:2 (median) for UCF Sports.
we use input resolutions of 288384, 224384, 224416 and 256384 for SALICON, DHF1K,
Hollywood-2 and UCF Sports, respectively.
不同視訊數據集合影格率不同，分別隔5和4取一幀畫面，保證網路畫面輸入6HZ
The frame rate of the DHF1K videos is 30 fps compared to 24 fps for Hollywood-2 and UCF Sports.
In order to assimilate the frame rates during training, and to train on longer time intervals, we construct clips using every 5th frame for DHF1K and every 4th frame for all others, yielding 6 fps overall.

Experiments

Experimental Setup

數據集合的劃分
For SALICON, we use the official training/validation/testing split of 10,000/5,000/5,000.
For Hollywood-2 and UCF Sports,we use the training and testing splits of 823/884 and 103/47 videos, and the corresponding validation sets are randomly sampled 10% from the training sets, following.
For Hollywood-2, the videos are divided into individual shots.
For the DHF1K dataset, we use the official training/validation/testing splits of 600/100/300 videos.
學習率設定，視訊12frames，就是過去2s的畫面輸入
Stochastic Gradient Descent with momentum of 0.9 and weight decay of 10

Video Saliency Detection 文獻學習（一）Unified Image and Video Saliency Modeling

文章目錄

論文