論文資訊

論文標題：MFAN: Multi-modal Feature-enhanced Attention Networks for Rumor Detection
論文作者：Jiaqi Zheng, Xi Zhang, Sanchuan Guo, Quan Wang, Wenyu Zang, Yongdong Zhang
論文來源：2022,IJCAI
論文地址：download
論文程式碼：download

Abstract

　　本文提出的模型 MFAN 第一次將文字、視覺和社圖譜特徵融入同一個框架中。此外，還同時考慮了互補和不同模態之間的對齊關係來達到更好的融合。

1 Introduction

　　傳統的謠言檢測模型主要依賴與提取文字特徵作為源帖表示，然後做分類。提出融合文字和視覺特徵比單獨使用文字效果更好工作：[Khattar et al., 2019;Wang et al., 2018; Zhou et al., 2020]，上述工作的缺點在於沒有考慮 graphical social contexts simultaneously，使用這種東西被證明有益的工作 [Yuan et al., 2019]。

　　源貼文的社交背景通常涉及到其轉發使用者和相應的評論。基於這些實體及其連線，可以構建一個異構圖來建模結構資訊。那麼就可以使用 GNNs 模型，如 GAT 、GCN ，這些模型存在的問題：

節點表示學習的質量高度依賴於實體之間的可靠連結。由於隱私問題或資料爬行約束，可用的社交圖譜資料很可能缺乏實體之間的一些重要的連結。因此，有必要補充社交圖譜上的潛在連結，以實現更準確的檢測；
圖上相鄰節點之間可能存在各種潛在關係，而傳統的圖神經網路（GNN）鄰域聚合過程可能無法區分它們對目標節點表示的影響，導致效能較差；
如何有效地將學習到的社會圖譜特徵與其他模態特徵（如視覺特徵）整合起來，在現有的研究中探索較少。

　　具體地說，引入了自我監督損失來對齊從兩種不同的檢視中學習到的源後表示，即文字-視覺檢視和社會圖檢視，旨在提高每個檢視中的表示學習。一方面，我們提出了推斷社交圖中節點之間的潛在連結，以緩解不完全連結問題。另一方面，我們利用有符號注意機制來捕獲正和負鄰域相關性，以實現更好的節點表示。通過上述增強的跨模態融合和社交圖表示學習，我們可以提高多媒體謠言檢測的效能。

　　貢獻：

提出了一種用於多媒體謠言檢測的多模態特徵增強注意網路，它可以有效地將文字、視覺和社會圖的特徵結合在一個單一的框架中。
引入了一種自監督損失來在不同的檢視中對齊源後表示，以實現更好的多模態融合。
通過增強圖的拓撲結構和鄰域聚合過程來改進社會圖的特徵學習。
經驗表明，該模型可以有效地識別謠言，並在兩個大規模的真實資料集上優於最先進的基線。

2 Related Work

　　相關工作對比：

3 Problem Defnition

　　Let $P=\left\{p_{1}, p_{2}, \cdots, p_{n}\right\}$ be a set of multimedia posts on social media with both texts and images. For each post $p_{i} \in P$ , $p_{i}=\left\{t_{i}, v_{i}, u_{i}, c_{i}\right\}$ , where $t_{i}$, $v_{i}$ and $u_{i}$ denote the text, image and user who have published the post. $c_{i}=\left\{c_{i}^{1}, c_{i}^{2}, \cdots, c_{i}^{j}\right\}$ represents the set of comments of $p_{i}$ . Moreover, each comment $c_{i}^{j}$ is posted by a corresponding user $u_{i}^{j}$ .

　　In order to represent user behaviors on social media, we establish a graph $G=\{V, A, E\}$ , where $V$ is a set of nodes, including user nodes, comment nodes, and post nodes. $A \in\{0,1\}^{|V| *|V|}$ is an adjacency matrix between nodes to describe the relationships between nodes, including posting, commenting, and forwarding. $E$ is the set of edges.

　　We define rumor detection as a binary classification task. $y \in\{0,1\}$ denotes class labels, where $y=1$ indicates rumor, and $y=0$ otherwise. Our goal is to learn the function $F\left(p_{i}\right)=y$ to predict the label of a given post $p_{i}$ .

4 Methodology

　　我們建議的重點是有效地結合文字、視覺和社交圖特徵，以提高謠言檢測。為此，我們首先提取了這三種型別的特徵。為了產生更好的社會圖特徵，我們提出了基於GAT的圖拓撲和聚合過程。然後，我們捕獲跨模態互動和對齊，以實現更好的多模態融合。最後，我們連線了增強的多模態特徵來進行分類。我們還應用對抗性訓練來提高魯棒性。整個體系結構如 Figure 1 所示。

4.1 Textual and Visual Feature Extractor

Textual Representations

　　對於每個貼文 $p_{i}$，它的文字內容 $t_{i}$ 被填充或者截斷為相同長度 $L$ 的 Token ，可以表示為：

　　　　$\mathcal{O}_{1: L}^{i}=\left\{o_{1}^{i}, o_{2}^{i}, \cdots, o_{L}^{i}\right\} \quad\quad\quad(1)$

　　其中，$o \in \mathbb{R}^{d}$ 是 $d$ 維的詞嵌入，$o_{j}^{i}$ 表示 $t_{i}$ 的第 $j$ 個詞的嵌入詞。【One-hot】

　　對上述的 Token 即詞嵌入矩陣 $\mathcal{O}_{j: j+k-1}^{i}$，使用折積 CNN 獲得特徵對映 $s_{i j}$，其中 $k$ 是感受野的大小，上述特徵對映可以完整表達為：$s^{i}= \left\{s_{i 1}, s_{i 2}, \cdots, s_{i(L-k+1)}\right\}$，然後在完整的特徵對映 $s^{i}$ 上使用最大池化獲得 $\hat{s}^{i}=\max \left(s^{i}\right) $，這裡使用不同的折積核 $k \in\{3,4,5\}$ 來獲得不同粒度的語意特徵。最後，我們 concat 所有 flters 的輸出，形成 $t_i$ 的整體文字特徵向量：

　　　　$R_{t}^{i}=\operatorname{concat}\left(s_{k=3}^{i \hat{i}}, s_{k=4}^{\hat{i}}, s_{k=5}^{\hat{i}}\right) \quad\quad\quad(2)$

Visual Representations

　　使用預訓練框架 $\operatorname{ResNet} 50$ 獲得貼文中影象 $v_i$ 的特徵嵌入 $V_{r}^{i}$，最後將其輸入一個全連線層，即：

　　　　$R_{v}^{i}=\sigma\left(W_{v} * V_{r}^{i}\right) \quad\quad\quad(3)$

4.2 Enhanced Social Graph Feature Learning Inferring Hidden Links

　　為緩解缺失連線的問題，我們建議來推斷社群網路中節點之間的隱藏連結。具體地說，我們將節點嵌入矩陣轉換為 $X \in \mathbb{R}^{|V| \times d}$，其中 $d$ 是維數大小。$X$ 中有三種型別的節點，我們使用句子向量作為貼文和評論節點的初始嵌入，並使用使用者釋出的後節點嵌入的平均值作為初始使用者嵌入。

　　為緩解缺失連線的問題，我們建議來推斷社群網路中節點之間的隱藏連結。節點嵌入矩陣為 $X \in \mathbb{R}^{|V| \times d}$ ，

一個使用者可以釋出多個貼文
使用句子向量作為貼文和評論的初始嵌入 [ 貼文包含多個 sentance ]
使用貼文嵌入的平均值作為作為初始的使用者嵌入

貼文包含文字，使用文字的嵌入作為貼文的初始嵌入

評論包含文字，使用文字的嵌入作為評論的初始嵌入

使用者嵌入通過計算貼文嵌入的平均值獲得

　　然後根據節點 $n_{i}$ 和 $n_{j}$ 的餘弦相似度計算它們之間的相關性 $\beta_{i j}$

　　　　$\beta_{i j}=\frac{x_{i} \cdot x_{j}}{\left\|x_{i}\right\|\left\|x_{j}\right\|} \quad\quad\quad(4)$

　　其中，$x_{i}$ 和 $x_j$ 是 $n_i$ 和 $n_j$ 的節點嵌入。如果相似度大於 $0.5$，我們推斷它們之間存在一個潛在的邊，即：

　　　　$e_{i j}=\left\{\begin{array}{l}0, \text { if } \beta_{i j}<0.5 \\1, \text { otherwise }\end{array}\right. \quad\quad\quad(5)$

　　然後利用推斷的勢邊增強原始鄰接矩陣 $A \in \mathbb{R}^{|V| \times|V|}$。$a_{i j}$ 表示 $A$ 的元素，其中 $a_{i j}=1$ 表示 $n_{i}$ 和 $n_{j}$ 之間有一條邊，否則則表示 $a_{i j}=0$。然後將增強鄰接矩陣 $A^{\prime}$ 中的元素 $a_{i j}^{\prime}$ 定義為

　　　　$a_{i j}^{\prime}=\left\{\begin{array}{l}0, \text { if } e_{i j}=0 \text { and } a_{i j}=0 \\1, \text { otherwise }\end{array}\right. \quad\quad\quad(6)$

Capturing Multi-aspect Neighborhood Relations

　　通過GAT 計算節點之間的注意力係數：

　　　　$\mathcal{E}_{i}=\left\{e_{i 1}^{\prime}, e_{i 2}^{\prime}, \cdots, e_{i\left|\mathcal{N}_{i}\right|}^{\prime}\right\}$

　　其中節點$n_i$ 和 $n_j$ 之間的注意力：

　　　　$e_{i j}^{\prime}=\operatorname{LeakyRe} L U\left(\hat{a}\left[W x_{i} \| W x_{j}^{\prime}\right]\right)\quad\quad\quad(7)$

　　注意力機制存在的問題：未經過 softmax 的注意力係數可能出現很大的負權：

　　　　$\mathcal{E}_{t}=\{0.7,0.3,-0.1,-0.9\}$

　　注意力權重經過 softmax 的結果為：

　　　　$\mathcal{E}_{t}^{\prime}=\{0.43,0.29,0.20,0.09\}$

　　然而，「-0.9」可能表示這兩個節點向量的處於相反位置。顯然這種負相關的關係對於謠言檢測很有幫助，如一個人說了與其行為不相關的評論。

　　受到 QSAN 的啟發，本文使用了一種符號注意力機制 Signed GAT，具體地說，對於節點 $n_{i}$，我們將其相鄰節點的注意權值 $\mathcal{E}_{i}$ 的反演表示為 $\tilde{\mathcal{E}}_{i}=-\mathcal{E}_{i}$。然後，我們用 softmax 函數計算 $\mathcal{E}_{i}$ 和 $\tilde{\mathcal{E}}_{i}$ 的歸一化權值，

　　　　$\begin{aligned}\mathcal{E}_{i}^{\prime} &=\operatorname{softmax}\left(\mathcal{E}_{i}\right) \\\tilde{\mathcal{E}}_{i}^{\prime} &=\operatorname{softmax}\left(\tilde{\mathcal{E}}_{i}\right)\end{aligned}\quad\quad\quad(8)$

　　為了捕獲節點之間的正關係和負關係，我們分別利用 $\mathcal{E}_{i}^{\prime}$ 和 $-\tilde{\mathcal{E}}_{i}^{\prime}$ 得到鄰居節點特徵的加權和。然後我們將這兩個向量連線在一起，通過一個全連線層，得到最終的節點特徵。例如，$n_{i}$ 的節點特徵可以通過

　　　　$\hat{x}_{i}=\sigma\left(W_{n} *\left(\mathcal{E}_{i}^{\prime} * X_{j} \|-\tilde{\mathcal{E}}_{i}^{\prime} * X_{j}\right)\right)\quad\quad\quad(9)$

　　例子：

import numpy as np
import torch.nn.functional as  F
import torch
if __name__ =="__main__":
    data = torch.tensor([ 0.2 ,-1 , 1 ,-0.1  ])
    out = F.softmax(data,dim=-1)
    print("data = ",data.numpy())
    print("softmax data = ",out.numpy())

    data = torch.tensor([ 0.2 ,-1 , 1 ,-0.1 ])*-1
    out = F.softmax(data,dim=-1)
    print("-data = ", data.numpy())
    print("softmax data = ",out.numpy())

　　輸出：

data =  [ 0.2 -1.   1.  -0.1]
softmax data =  [0.2343263  0.07057773 0.5215028  0.1735932 ]
-data =  [-0.2  1.  -1.   0.1]
softmax data =  [0.16341725 0.5425644  0.0734281  0.22059022]

Graph Feature Extractor

　　然後利用 Signed GAT 從增強的圖中提取圖的結構特徵。對於每個節點，我們根據 $\text{Eq.9}$ 更新其嵌入，得到更新後的節點嵌入矩陣 $\hat{X} \in \mathbb{R}^{|V| \times d}$，其中 $|V |$ 為節點數，$d$ 為維數大小。然後採用多頭注意機制，從不同的角度捕捉特徵。我們將每個頭部的更新後的節點嵌入連線在一起，作為整體的圖特徵：

　　　　$\hat{G}=\|_{h=1}^{H} \sigma\left(\hat{X}_{h}\right)\quad\quad\quad(10)$

　　其中，$H$ 為頭的數量。然後，第 $i$ 個貼文的圖特徵 $R_{g}^{i}$ 對應於 $\hat{G}$ 的第 $i$ 個列。

4.3 Multi-modal Feature Fusing

　　在本工作中，由於有三種型別的資料，我們採用了具有共同注意方法的層次融合模式[Lu et al.，2019]。為了捕獲跨模態關係的不同方面並增強多模態特徵，我們提出在自監督損失下強制執行跨模態對齊。

Cross-modal Co-attention Mechanism

　　對於每個模態，首先使用多頭自注意力機制去增強模態內的特徵表示，比如對於文字特徵 $R_{t}^{i}$，計算 $Q_{t}^{i}=R_{t}^{i} W_{t}^{Q}$、$K_{t}^{i}=R_{t}^{i} W_{t}^{K}$、$V_{t}^{i}=R_{t}^{i} W_{t}^{V}$、$V_{t}^{i}=R_{t}^{i} W_{t}^{V}$ （其中，$W_{t}^{Q}, W_{t}^{K}, W_{t}^{V} \in \mathbb{R}^{d \times \frac{d}{H}}$，$H$ 代表 head 的數量），然後，我們產生了文字模態的多頭自注意特徵為

　　　　${\large Z_{t}^{i}=\left(\|_{h=1}^{H} \operatorname{softmax}\left(\frac{Q_{t}^{i} K_{t}^{i^{T}}}{\sqrt{d}}\right) V_{t}^{i}\right) W_{t}^{O}} \quad\quad\quad(11)$

　　按照上述多頭注意力的方法分別用與圖片特徵 $R_{v}^{i}$ 和圖特徵 $R_{g}^{i} $ 得到兩者的最終表示 $Z_{v}^{i} $ 和 $Z_{g}^{i}$。

　　接著使用交叉注意力機制，對於文字和視覺特徵，進行如下交叉注意力機制，獲得視覺和文字的特徵：

　　　　$Z_{v t}^{i}=\left(\|_{h=1}^{H} \operatorname{softmax}\left(\frac{Q_{v}^{i} K_{t}^{i}}{\sqrt{d}}\right) V_{t}^{i}\right) W_{v t}^{O}\quad\quad\quad(12)$

　　Note：$Z_{v t}^{i}$ 代表著 text-visual feature 的融合，同理可以得到 visual-text feature 的融合 $Z_{t v}^{i}$。

Multi-modal Alignment

　　模型對齊：指增強的源帖圖特徵和文字特徵被轉換到相同的特徵空間：

　　　　$\begin{array}{l}Z_{g}^{i^{\prime}}=W_{g}{ }^{\prime} Z_{g}^{i} \\Z_{t}^{i^{\prime}}=W_{t}^{\prime} Z_{v t}^{i}\end{array}\quad\quad\quad(13)$

　　然年通過 MSE 計算文字和視覺特徵的特徵對齊損失：

　　　　$\mathcal{L}_{\text {align }}=\frac{1}{n} \sum\limits_{i=1}^{n}\left(Z_{g}^{i^{\prime}}-Z_{t}^{i^{\prime}}\right)^{2}\quad\quad\quad(14)$

　　然後，我們得到了對齊參考的文字特徵 $\tilde{Z}_{t}^{i}$ 和圖特徵 $\tilde{Z}_{g}^{i}$，它們用於下面的多模態融合。

Fusing the Above Multi-modal Features

　　再次使用上述的 cross-modal co-attention mechanism 獲得三種模態特徵 $\tilde{Z}_{t}^{i}$、$\tilde{Z}_{g}^{i}$、$Z_{v}^{i}$ 之間的多模態特徵 $\tilde{Z_{t v}^{i}}$、$\tilde{Z_{v t}^{i}}$、$\tilde{Z_{g t}^{i}}$、$\tilde{Z_{t g}^{i}}$、$\tilde{Z_{g v}^{i}}$、$\tilde{Z_{v g}^{i}}$，最後將上述交叉模態特徵拼接得到最終的多模態特徵：

　　　　$Z^{i}=\operatorname{concat}\left(\tilde{Z_{t v}^{i}}, \tilde{Z_{v t}^{i}}, \tilde{Z_{g t}^{i}}, \tilde{Z_{t g}^{i}}, \tilde{Z_{g v}^{i}}, \tilde{Z_{v g}^{i}}\right)\quad\quad\quad(15)$

4.4 Classifcation with Adversarial Training

　　將貼文 $p_{i}$ 的多模態特徵 $Z^{i}$ 輸入全連線層，以預測 $p_{i}$ 是否是謠言：

　　　　$\hat{y}_{i}=\operatorname{softmax}\left(W_{c} Z^{i}+b\right)$

　　其中，$\hat{y}_{i}$ 表示 $p_{i}$ 成為謠言的預測概率。然後我們用交叉熵損失函數作為

　　　　$\mathcal{L}_{\text {classify }}=-y \log \left(\hat{y}_{i}\right)-(1-y) \log \left(1-\hat{y}_{i}\right)$

　　總體損失可寫如下：

　　　　$\mathcal{L}=\lambda_{c} \mathcal{L}_{\text {classify }}+\lambda_{a} \mathcal{L}_{\text {align }}$

　　其中，$\lambda_{c}$ 和 $\lambda_{a}$ 被用來平衡這兩個損失。

5 Experiments

Datasets

Weibo：微博資料集；
PHEME：Twitter 平臺上的資料，包括5 個 breaking news ；

Baselines

EANN [Wang et al., 2018] is a GAN-based model exploiting both text and image data. It derives eventinvariant features and benefits newly arrived events.
MVAE [Khattar et al., 2019] uses a bimodal variational autoencoder coupled with a binary classifier for multimodal fake news detection.
QSAN [Tian et al., 2020] integrates the quantum-driven text encoding and a novel signed attention mechanism for false information detection.
SAFE [Zhou et al., 2020] jointly exploits multi-modal features and cross-modal similarity to learn the representation of news articles.
EBGCN [Wei et al., 2021] rethinks the reliability of latent relations in the propagation structure by adopting a Bayesian approach.
GLAN [Yuan et al., 2019] jointly encodes the local semantic and global structural information and applies a global-local attention network for rumor detection.

Implementation Details

training：validation：testing = 7：1：2
使用 [Yuan et al., 2019] 的 word vectors 初始化 word embedding。
$H=8$ 代表著 $8$ 頭注意力
並設定 $\lambda_{c} = 2.15$ 和 $\lambda_{a} = 1.55$

Results

Performance of the Variations

　　「-w/o V」, 「-w/o G」, 「-w/o P」, and 「-w/o A」分別代表著不使用 visual information, social graph information,potential links, and modal alignment

　　結果顯示：

　　(i) visual modal and graph features are both important for rumor detection;
　　(ii) the modal alignment can facilitate the multi-modal fusion;
　　(iii) considering latent links can signifcantly improve;

6 Conclusions

　　在本文中，我們提出了一個多模態謠言檢測框架，它通常包含了三種模態，即文字、影象和社交圖。為了改進社會圖特徵學習，基於GAT增強了圖拓撲和鄰域聚合過程。我們的框架通過引入跨模態對齊來實現更有效的多模態融合。對中文和英語資料集的評估和比較表明，我們的模型可以優於最先進的多媒體謠言檢測基線。

謠言檢測——《MFAN: Multi-modal Feature-enhanced Attention Networks for Rumor Detection》