論文資訊

論文標題：Debunking Rumors on Twitter with Tree Transformer
論文作者：Jing Ma、Wei Gao
論文來源：2020，COLING
論文地址：download
論文程式碼：download

1 Introduction

　　出發點：Existing conversation-based techniques for rumor detection either just strictly follow tree edges or treat all the posts fully-connected during feature learning.

　　創新點：Propose a novel detection model based on tree transformer to better utilize user interactions in the dialogue where post-level self-attention plays the key role for aggregating the intra-/inter-subtree stances.

　　例子：以 PLAN 模型為例子——一種貼文之間全連線的例子

　　結論：Post 之間全連線的模型只適合淺層模型，並不適合深層模型，這是由於 Post 一般只和其 Parent 相關嗎，全連線導致 Post 之間的錯誤連線加重。

2 Tree Transformer Model

　　總體框架如下：

2.1 Token-Level Tweet Representation

　　Transformer encoder 框架：

　　給定一條表示為 word sequence $x_{i}=\left(w_{1} \cdots w_{t} \cdots w_{\left|x_{i}\right|}\right)$ 的推文，每個 $w_{t} \in \mathbb{R}^{d}$ 是一個 $d$ 維向量，可以用預先訓練的單詞嵌入初始化。我們使用多頭自注意網路（MH-SAN）將每個 $w_{i}$ 對映到一個固定大小的隱藏向量中。MH-SAN 的核心思想是共同關注來自不同位置的不同表示子空間的單詞。更具體地說，MH-SAN 首先將輸入字序列 $x_i$ 轉換為具有不同線性投影的多個子空間：

　　　　$Q_{i}^{h}, K_{i}^{h}, V_{i}^{h}=x_{i} \cdot W_{Q}^{h}, \quad x_{i} \cdot W_{K}^{h}, \quad x_{i} \cdot W_{V}^{h} \quad\quad\quad(1)$

　　其中，$\left\{Q_{i}^{h}, K_{i}^{h}, V_{i}^{h}\right\}$ 分別為 query、key 和 value representations，$\left\{W_{Q}^{h}, W_{K}^{h}, W_{V}^{h}\right\} $ 表示與第 $h$ 個頭關聯的引數矩陣。然後，應用 attention function 來生成輸出狀態。

　　　　$O_{i}^{h}=\operatorname{softmax}\left(\frac{Q_{i}^{h} \cdot K_{i}^{h^{\top}}}{\sqrt{d_{h}}}\right) \cdot V_{i}^{h} \quad\quad\quad(2)$

　　其中，$\sqrt{d_{h}}$ 是放縮因子，$d_{h}$ 表示第 $h$ 個頭的子空間的維數。最後，表示的輸出可以看作是所有頭 $O_{i}=\left[O_{i}^{1}, O_{i}^{2}, \cdots, O_{i}^{n}\right] \in \mathbb{R}^{\left|x_{i}\right| \times d}$ 的連線，$n$ 為頭數，然後是一個歸一化層（layerNorm）和前饋網路（FFN）。

　　　　$\begin{array}{l}B_{i}=\operatorname{layerNorm}\left(O_{i} \cdot W_{B}+O_{i}\right) \\H_{i}=\operatorname{FFN}\left(B_{i} \cdot W_{S}+B_{i}\right)\end{array} \quad\quad\quad(3)$

　　其中 $H_{i}=\left[h_{1} ; \ldots ; h_{\left|x_{i}\right|}\right] \in \mathbb{R}^{\left|x_{i}\right| \times d}$ 是表示 tweet $x_i$ 中所有單詞的矩陣，$W_{B}$ 和 $W_{h}$ 包含 transformation 的權值。最後，我們通過 maxpooling 所有相關 words 的向量，得到了 $x_i$ 的表示：

　　　　$s_{i}=\max -\operatorname{pooling}\left(h_{1}, \ldots, h_{\left|x_{i}\right|}\right) \quad\quad\quad(4)$

　　其中，$s_{i} \in \mathbb{R}^{1 \times d}$ 為 $d$ 維向量，$|\cdot|$ 為單詞數。

2.2 Post-Level Tweet Representation

　　Why we choose Cross-check all the posts in the same subtree to enhance the representation learning：

　　(1) posts are generally short in nature thus the stance expressed in each node is closely correlated with the responsive context;

　　(2) posts in the same subtree direct at the individual opinion expressed in the root of the subtree.

　　(3) Coherent opinions can be captured by comparing ALL responsive posts in the same subtree, that lower weight the incorrect information.

Bottom-Up Transformer

　　Figure 2(c) 說明了本文的 tree transformer 結構，它 cross-check 從底部子樹到上部子樹的 post。具體來說，給定一個有根於 $x_j$ 的子樹，假設 $\mathcal{V}(j)=\left\{x_{j}, \ldots, x_{k}\right\}$ 表示子樹中的節點集合，即 $x_j$ 及其直接響應節點。然後，我們在 $\mathcal{V}(j)$ 上應用一個 post-level subtree attention（a transformer-based block as shown in Figure 2(b)），以得到 $\mathcal{V}(j)$ 中每個節點的細化表示：

　　　　$\left[s_{j}^{\prime} ; \ldots ; s_{k}^{\prime}\right]=\operatorname{TRANS}\left(\left[s_{j} ; \ldots ; s_{k}\right], \Theta_{T}\right) \quad\quad\quad(5)$

　　其中，$TRANS (\cdot)$ 是具有如 Eq. 2-4 中所示的相似形式的 transform function，$\Theta_{T}$ 包含了 transformer 的引數。因此，$s_{*}^{\prime}$ 是基於子樹的上下文得到的 $s_{*}$ 的細化表示。請注意，每個節點都可以被視為不同子樹中的父節點或子節點，例如，在 Figure 2(a) 中，$x_{2}$ 可以是 $T\left(x_{2}\right)$ 的父節點，也可以是 $T(r)$ 的子節點。因此，一部分的節點在我們的 from bottom subtree to upper subtree 模型中結果兩次層次細化：(1)通過與父節點相比來捕獲立場 stance，(2) 通過關注鄰居節點來獲得較低權重的不準確資訊，例如，一個父母支援一個錯誤的宣告可能會細化如果大多數響應駁斥父節點。

Top-Down Transformer

　　Top-down transformer 的方向與 bottom-up transformer 相反，沿著資訊傳播的方向，其架構如 Figure 2 (d) 所示。同樣的，其學習到的表示也通過捕獲立場和自我糾正上下文資訊得到增強。

2.3 The overall Model

　　為了共同捕獲整個樹中表達的觀點，我們利用一個注意力層來選擇具有準確資訊的重要貼文，這是基於細化的節點表示而獲得的。這將產生：

　　　　$\begin{array}{l}\alpha_{i}=\frac{\exp \left(s_{i}^{\prime} \cdot \mu^{\top}\right)}{\sum\limits_{j} \exp \left(s_{j}^{\prime} \cdot \mu^{\top}\right)} \\\tilde{s}=\sum\limits_{i} \alpha_{i} \cdot s_{i}^{\prime}\end{array}\quad\quad\quad(6)$

　　其中，$s_{i}^{\prime}$ 由 Bottom-Up Transformer 或 Top-Down Transformer 得到，$\mu \in \mathbb{R}^{1 \times d}$ 是注意力機制的引數。這裡的 $\alpha_{i}$ 是節點 $x_i$ 的注意權值，用於生成整個樹的表示 $\tilde{s}$。最後，我們使用一個全連線的輸出層來預測謠言類上的概率分佈。

　　$\hat{y}=\operatorname{softmax}\left(V_{o} \cdot \tilde{s}+b_{o}\right) \quad\quad\quad(7)$

　　其中，$V_{o}$ 和 $b_{o}$ 是輸出層中的權值和偏差。

　　此外，還有一種直接的方法可以將 Bottom-Up transformer 與 Top-Down transformer 的樹表示連線起來，以獲得更豐富的樹表示，然後將其輸入上述的 $softmax (\cdot)$ 函數進行謠言預測。

　　我們所有的模型都經過訓練，以最小化預測的概率分佈和地面真實值的概率分佈之間的平方誤差：

　　　　$L(y, \hat{y})=\sum_{n=1}^{N} \sum_{c=1}^{C}\left(y_{c}-\hat{y}_{c}\right)^{2}+\lambda\|\Theta\|_{2}^{2} \quad\quad\quad(8)$

　　其中 $y_{c}$ 是 ground-truth label ，$\hat{y}_{c}$ 是類C的預測概率，$N$ 是訓練的樹數，C 是類的數量，$\|.\|_{2}$ 是所有模型引數 $\Theta$ 上的 $L_{2}$ 正則化項，$\lambda$ 是權衡係數。

3 Experiments

Datasets

　　使用 TWITTER 和 PHEME 資料集進行實驗，按照傳播樹深度將兩個資料集劃分為 TWITTER-S (PHEME-S)和 TWITTER-D (PHEME-D) 一共4個資料集，下表展示資料集的統計情況：

Experiment

Early Rumor Detection Performance