由於課題有關人員的Attention/Gaze/Saliency建模,所以最近讀了一些Saliency Detection 的文章。利用部落格一方面作爲學習筆記記錄一下,一方面也與大家進行交流。
第一篇文章是2020年ECCV, University of Oxford研究團隊的Unified Image and Video Saliency Modeling。這篇文章最大的特點就是整合了靜態影象和動態視訊的Saliency Detection功能,相比傳統以靜態圖Saliency Detection的連續輸出預測視訊,有明顯的創新之處。
開篇直接說明現在建模的gap,影象和視訊模型獨立分割的狀況,發出疑問能否融合有點進行統一建模
Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature.
Can image and video saliency modeling be approached via a unified model, with mutual benefit?
模型訓練和測試使用多個公開的影象和視訊 Saliency 數據集
影象數據集 | 視訊數據集 |
---|---|
DHF1K, Hollywood-2 and UCF-Sports, | SALICON and MIT300 |
建模中引入了Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain Adaptive Smoothing and Bypass-RNN
四個創新的方法。
結果:功能上可以通過參數控制實現預測模式的切換,模型參數量上實現了輕量化,且對比之前模型精度有所提升(必須打上666)。
第一段隨便介紹saliency prediction/modeling;
第二段引入動態視訊數據集和模型;
第三段介紹靜態動態建模分割,尤其是一些需要輸入光流和固定幀數畫面的網路,因此作者發問了
Is it possible to model static and dynamic saliency via one unified framework, with mutual benefit?
第四段,作者想提出影象和視訊領域相互切換的模型the domain shift between image and video saliency data,即具有領域自適應技術domain adaption techniques
特性的UNISAL neural network architecture(用詞真高階)。使用DHF1K, Hollywood-2 and UCF-Sports和SALICON數據集進行網路訓練。
第五段,使用上述四個訓練集的測試集合部分以及MIT300發現,Unified Image and Video Saliency Modeling效能卓越,outperforms current state-of-the-art methods on all video saliency datasets and achieves competitive performance for image saliency prediction。
第五段,總結主要貢獻點:
Saliency Modeling歷史介紹,從Itti教授的bottom-up模型介紹到深度學習top-down,簡單介紹了近幾年的幾篇文章。引入到動態視訊
幾類傳統方法:low-level visual statistics, with additional temporal features (e.g., optical flow);the center-surround saliency in static scenes。侷限性,limited by the representation ability of the low-level features for temporal information.
深度學習方法:via a multi-stream convolutional long short-term memory network (ConvLSTM);
attention mechanism with ConvLSTM;3D convolutions。侷限於resulting in limited applicability to static scenes
之前用於影象空間和時間特徵提取的方法,spatio-temporal motion patterns,the spatio-temporal domain by using LSTMs,the phase spectrum of the Quaternion Fourier Transform.限制於rendering the models unable to simultaneously model image saliency
進入正題如何建模
首先對DHF1K, Hollywood-2 and UCF-Sports和SALICON數據集進行了域偏差分析the domain shift analyses。
處理的方法:隨機從四個數據集閤中抽取256幀畫面,輸入到pre-trained MobileNetV2網路中得到輸出的特徵向量後,使用t-SNE演算法降維視覺化。
爲了減少數據集合之間的的均值方差,有利於神經網路的訓練,一個標準化的操作流程。
Batch normalization (BN) aims to reduce the internal covariate shift of neural network activations by transforming their distribution to zero mean and unit variance for each training batch.
採用兩種策略進行Normalization:
如下圖對比可見,經過標準化處理後的數據集在聚類後明顯聚類現象得到減緩。因此模型針對每個數據集合建立了對應的BM modules
.
不同數據集和的注視區域中央聚集偏差strongest center bias程度不同,但 Hollywood-2 and UCF 數據集合表現十分明顯,但影象數據集SALICON的散度又十分顯著,這可能是由於影象數據集每張圖5s的觀察時間所決定。因此提出learn a separate set of Gaussian prior maps
for each dataset.
不同數據間的人員採集狀態不相同
Hollywood-2 and UCF Sports datasets are task-driven, the viewer is instructed to identify the main action shown.The DHF1K dataset contains free-viewing fixations.
Fussion layer(1X1 convolution)
進行特徵層融合方法: 搭建一個簡單的saliency modeling,使用MNet V2提取特徵後,使用一個Fusion layer (1X1 convolution)進行特徵層提取,對比兩種建模方法的效能差異。
結果如下,明顯domain-adaptive效能更爲突出。
不同數據集的模糊濾波方法不相同blurring filter,將數據集resize到網路輸入後,計算數據得銳度sharpnnes分佈如下,可見存在較大差異。尤其是DHF1K有最大的銳度sharpness(maximum gradient)。
因此提出了針對數據集建立smoothing kernel blur the network output with a different learned Smoothing kernel
for each dataset
總體是一個encoder-RNN-decoder框架結構
主幹選用了MobileNet-V2三點原因:
由於採用的是ImageNet-pretrained parameters網路輸入標準化格式固定,Domain-Adaptive Batch Normalization並未被使用
。
domain-adaptive Gaussian prior maps計算公式如下
參數:gama= 6,由MNetV2 中RELU6決定。 we propose unconstrained Gaussianprior maps
,instead of drawing the initial Gaussian parameters from a normal distribution, which results in highly correlated maps
從b圖可以看出x, y成正太分佈。Gaussian Prior Maps放在了CNN和RNN之間,to model the static center bias;in order to leverage the prior maps in higher-level features
之前提取空間特徵的方法·RNN, optical flow or 3D convolutions都不適用於靜態圖。提出了Bypass-RNN模組,殘差連線的方式選擇提取影象特徵。
RNN whose output is added to its input features via a residual connection that is automatically omitted (bypassed) for static batches during training and inference
使用convolutional GRU (cGRU) RNN構建Bypass-RNN模組
Decoder的細節如下,爲降低參數量使用了比較少多的深度可分離折積 depthwise separable convolution以及逐點折積pointwise 1x1 convolution。
The Post-US2 features are reduced to a single channel by an Domain-Adaptive Fusion layer
(1x1 convolution) and upsampled to the input resolution via nearest-neighbor interpolation.
The upsampling is followed by a Domain- Adaptive Smoothing layer
with 41x41 convolutional kernels that explicitly model the dataset-dependent blurring of the ground-truth saliency maps.
每個數據集合影象的長寬比不同,輸入網路經過Domain-Adaptive Batch Normalization,可以實現特定網路特定尺寸輸入。
The images/frames of the training datasets each have different aspect ratios, specifically 4:3 for SALICON, 16:9 for DHF1K, and 1.85:1 (median) for Hollywood-2, and 3:2 (median) for UCF Sports.
we use input resolutions of 288384, 224384, 224416 and 256384 for SALICON, DHF1K,
Hollywood-2 and UCF Sports, respectively.
不同視訊數據集合影格率不同,分別隔5和4取一幀畫面,保證網路畫面輸入6HZ
The frame rate of the DHF1K videos is 30 fps compared to 24 fps for Hollywood-2 and UCF Sports.
In order to assimilate the frame rates during training, and to train on longer time intervals, we construct clips using every 5th frame for DHF1K and every 4th frame for all others, yielding 6 fps overall.
數據集合的劃分
For SALICON, we use the official training/validation/testing split of 10,000/5,000/5,000.
For Hollywood-2 and UCF Sports,we use the training and testing splits of 823/884 and 103/47 videos, and the corresponding validation sets are randomly sampled 10% from the training sets, following.
For Hollywood-2, the videos are divided into individual shots.
For the DHF1K dataset, we use the official training/validation/testing splits of 600/100/300 videos.
學習率設定,視訊12frames,就是過去2s的畫面輸入
Stochastic Gradient Descent with momentum of 0.9 and weight decay of 10