15_VLDB_Scalable topical phrase mining from text

2020-08-12 17:25:45

While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths.
Existing work either performs post processing to the results of unigram-based topic models, or utilizes complex n-gram-discovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized datasets. We propose a different approach that is both computationally efficient and effective.
Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.

雖然大多數主題建模演算法都使用單字組對文字語料庫進行建模,但人工解釋通常依賴於將術語固有地分組爲短語。 因此,我們考慮發現混合長度的主題短語的問題。
現有工作要麼對基於字母組合的主題模型的結果執行後處理,要麼利用複雜的n-gram發現主題模型。 這些方法通常會產生低品質的主題短語,甚至在中等大小的數據集上也可能具有較差的可伸縮性。 我們提出了一種計算效率和有效性都不同的方法。
我們的解決方案結合了新穎的詞組挖掘框架,可將文件分爲單詞和多詞短語,以及在誘導型文件分割區上執行的新主題模型。 我們的方法可以在包括研究出版物標題,摘要,評論和新聞報道在內的各種數據集中發現高品質的主題詞組,而詞袋主題模型的成本卻可以忽略不計。

In recent years, topic modeling has become a popular method for discovering the abstract ‘topics’ that underly a collection of documents. A topic is typically modeled as a multinomial distribution over terms, and frequent terms related by a common theme are expected to have a large probability in a topic multinomial. When latent topic multinomials are inferred, it is of interest to visualize these topics in order to facilitate human interpretation and exploration of the large amounts of unorganized text often found within text corpora. In addition, visualization provides a qualitative method of validating the inferred topic model [4]. A list of most probable unigrams is often used to describe individual topics, yet these unigrams often provide a hard-tointerpret or ambiguous representation of the topic. Augmenting unigrams with a list of probable phrases provides a more intuitively understandable and accurate description of a topic. This can be seen in the term/phrase visualization of an information retrieval topic in Table 1.

近年來,主題建模已成爲一種流行的方法,用於發現文件集合下的抽象「主題」。 通常將主題建模爲項上的多項式分佈,並且與常見主題相關的頻繁項預計在主題多項式中具有很大的概率。 當推斷出潛在主題多項式時,將這些主題視覺化是有利的,以便於人類解釋和探索文字語料庫中經常出現的大量無組織文字。 另外,視覺化提供了一種驗證推斷主題模型的定性方法[4]。 通常使用最可能的字母組合的列表來描述各個主題,但是這些字母組合通常會提供對該主題的難以解釋或模糊的表示。 用可能的短語列表增強字母組合,可以更直觀地理解和準確描述主題。 在表1中的資訊檢索主題的術語/短語視覺化中可以看到這一點。

While topic models have clear application in facilitating understanding, organization, and exploration in large text collections such as those found in full-text databases, difficulty in interpretation and scalability issues have hindered adoption. Several attempts have been made to address the prevalent deficiency in visualizing topics using unigrams.
These methods generally attempt to infer phrases and topics simultaneously by creating complex generative mechanism. The resultant models can directly output phrases and their latent topic assignment. Two such methods are Topical N-Gram and PD-LDA [27, 16]. While it is appealing to incorporate the phrase-finding element in the topical clustering process, these methods often suffer from highcomplexity, and overall demonstrate poor scalability outside small datasets.

儘管主題模型在促進大型文字集合(例如,全文數據庫中的集合)的理解,組織和探索方面具有明確的應用,但解釋困難和可伸縮性問題阻礙了採用。 已經進行了一些嘗試來解決在使用會標使主題視覺化方面普遍存在的缺陷。
這些方法通常嘗試通過建立複雜的生成機制 機製來同時推斷短語和熱門詞彙。 結果模型可以直接輸出短語及其潛在主題分配。 兩種這樣的方法是區域性N-Gram和PD-LDA [27,16]。 雖然吸引人的是在主題聚類過程中加入詞組查詢元素,但這些方法通常具有很高的複雜性,並且總體上證明了在小型數據集之外的可伸縮性較差。

Some other methods apply a post-processing step to unigrambased topic models [2, 6]. These methods assume that all words in a phrase will be assigned to a common topic, which, however, is not guaranteed by the topic model.

其他一些方法將後處理步驟應用於基於字母組合的主題模型[2,6]。 這些方法假定將短語中的所有單詞都分配給一個共同的主題,但是主題模型不能保證這一點。

We propose a new methodology ToPMine that demonstrates both scalability compared to other topical phrase methods and interpretability. Because language exhibits the principle of non-compositionality, where a phrase’s meaning is not derivable from its constituent words, under the ‘bag-of-words’ assumption, phrases are decomposed, and a phrase’s meaning may be lost [25]. Our insight is that phrases need to be systematically assigned to topics. This insight motivates our partitioning of a document into phrases, then using these phrases as constraints to ensure all words are systematically placed in the same topic.

我們提出了一種新的方法ToPMine,該方法論證了與其他主題短語方法相比的可伸縮性和可解釋性。 因爲語言表現出非可組合性的原則,即短語的意思不能從其組成詞衍生而來,所以在「詞袋」假設下,短語被分解,短語的意思可能會丟失[25]。 我們的見解是需要將短語系統地分配給主題。 這種洞察力促使我們將文件劃分爲短語,然後將這些短語用作約束條件,以確保所有單詞都系統地置於同一主題中。

We perform topic modeling on phrases by first mining phrases, segmenting each document into single and multi305 word phrases, and then using the constraints from segmentation in our topic modeling. First, to address the scalability issue, we develop an efficient phrase mining technique to extract frequent significant phrases and segment the text simultaneously. It uses frequent phrase mining and a statistical significance measure to segment the text while simultaneously filtering out false candidates phrases. Second, to ensure a systematic method of assigning latent topics to phrases, we propose a simple but effective topic model. By restricting all constituent terms within a phrase to share the same latent topic, we can assign a phrase the topic of its constituent words.

我們先對短語進行主題建模,方法是首先挖掘短語,將每個文件分爲單個和多個305個單詞短語,然後在主題建模中使用分段約束。 首先,爲了解決可延伸性問題,我們開發了一種有效的短語挖掘技術,以提取頻繁的重要短語並同時對文字進行分段。 它使用頻繁的短語挖掘和統計意義的措施對文字進行分割,同時濾除錯誤的候選短語。 其次,爲了確保將潛在主題分配給短語的系統方法,我們提出了一個簡單但有效的主題模型。 通過將短語中的所有組成詞限製爲共用相同的潛在主題,我們可以爲短語分配其組成詞的主題。

Example 1. By frequent phrase mining and context-specific statistical significance ranking, the following titles can be segmented as follows: Title 1. [Mining frequent patterns] without candidate generation: a [frequent pattern] tree approach.
Title 2. [Frequent pattern mining] : current status and future directions.
The tokens grouped together by [] are constrained to share the same topic assignment.

範例1.通過頻繁短語挖掘和特定於上下文的統計顯着性排名,可以對以下標題進行如下分類:標題1. [挖掘頻繁模式]而無需生成候選:[頻繁模式]樹方法。
標題2. [頻繁模式挖掘]:當前狀態和未來方向。
由[]分組在一起的標記必須共用同一主題分配。

Our TopMine method has the following advantages.

我們的TopMine方法具有以下優點。

  • Our phrase mining algorithm efficiently extracts candidate phrases and the necessary aggregate statistics needed to prune these candidate phrases. Requiring no domain knowledge or specific linguistic rulesets, our method is purely data-driven.
  • Our method allows for an efficient and accurate filtering of false-candidate phrases. In title 1 of Example 1, after merging ‘frequent’ and ‘pattern’, we only need to test whether ‘frequent pattern tree’ is a significant phrase in order to determine whether to keep ‘frequent pattern’ as a phrase in this title.
  • Segmentation induces a ‘bag-of-phrases’ representation for documents. We incorporate this as a constraint into our topic model eliminating the need for additional latent variables to find the phrases. The model complexity is reduced and the conformity of topic assignments within each phrase is maintained.
  • 我們的詞組挖掘演算法可以有效地提取候選日期詞組和修剪這些候選詞組所需的必要彙總統計資訊。 不需要領域知識或特定的語言規則集,我們的方法完全是數據驅動的。
  • 我們的方法可以有效,準確地過濾錯誤的候選短語。 在範例1的標題1中,將「 頻繁的」和「 模式」合併後,我們只需要測試「 頻繁模式樹」是否爲重要短語即可確定是否在該標題中保留「 頻繁模式」作爲短語。
  • 細分會導致文件的「短語袋」表示。 我們將此作爲約束納入主題模型中,從而無需其他潛在變數來查詢短語。 減少了模型的複雜性,並維護了每個短語中主題分配的一致性。

We state and analyze the problem in Section 2, followed by our proposed solution in Section 3. We present the key components of our solution, phrase mining and phraseconstrained topic modeling in Sections 4 and 5. In Section 6, we review the related work. Then we evaluate the proposed solution in Section 7, and conclude in Section 9.

我們在第2節中陳述和分析問題,然後在第3節中提出建議的解決方案。在第4和5節中介紹解決方案的關鍵組成部分,短語挖掘和短語約束主題建模。在第6節中,我們回顧了相關的內容。 工作。 然後,我們在第7節中評估提出的解決方案,並在第9節中得出結論。

2.PROBLEM DEFINITION

問題定義

The input is a corpus of D documents, where d-th document is a sequence of Nd tokens: wd,i, i 「 1, . . . , Nd. Let N 「 řDd「1 Nd. For convenience we index all the unique words in this corpus using a vocabulary of V words. And wd,i 「 x, x P t1, . . . , V u means that the i-th token in d-th document is the x-th word in the vocabulary. Throughout this paper we use ‘word x’ to refer to the x-th word in the vocabulary.

輸入是D個文件的主體,其中第d個文件是Nd令牌的序列:wd,i,i「 1,」。 。 。 ,Nd。 令N「×Dd」 1 Nd。 爲了方便起見,我們使用V詞的詞彙表來索引該語料庫中的所有唯一詞。 和wd,i「 x,x P t1,。 。 。 ,V u表示第d個文件中的第i個令牌是詞彙表中的第x個單詞。 在本文中,我們都使用「單詞x」來指代詞彙表中的第x個單詞。

Given a corpus and a number of topics as a parameter, our goal is to infer the corpus’ underlying topics and visualize these topics in a human-interpretable representation using topical phrases. Statistically, a topic k is characterized by a probability distribution φk over words. φk,x 「 ppx|kq P r0, 1s is the probability of seeing the word x in topic k, and řVx「1 φk,x 「 1. For example, in a topic about the database research area, the probability of seeing 「database」, 「system」 and 「query」 is high, and the probability of seeing 「speech」, 「handwriting」 and 「animation」 is low. This characterization is advantageous in statistical modeling of text, but is weak in human interpretability. Unigrams may be ambiguous, especially across specific topics. For example, the word 「model」 can mean different things depending on a topic - a model could be a member of the fashion industry or perhaps be part of a phrase such as 「topic model」. Using of phrases helps avoid this ambiguity.

給定語料庫和多個主題作爲參數,我們的目標是推斷語料庫的基礎主題,並使用主題短語以人類可理解的表示形式視覺化這些主題。 從統計上講,主題k的特徵是單詞上的概率分佈φk。 φk,x「 ppx | kq P r0,1s是在主題k中看到單詞x的概率,而VVx」 1φk,x「1。例如,在有關數據庫研究領域的主題中,看到「 「數據庫」,「系統」和「查詢」較高,並且看到「語音」,「手寫」和「動畫」的可能性較低。 這種表徵在文字的統計建模中是有利的,但是在人類可解釋性方面卻很弱。 字母組合可能會模棱兩可,尤其是在特定主題之間。 例如,「模型」一詞根據主題可以表示不同的含義-模型可以是時裝行業的成員,也可以是諸如「主題模型」之類的短語的一部分。 使用短語有助於避免這種歧義。

Definition 1. We formally define phrases and other necessary notation and terminology as follows: ‚ A phrase is a sequence of contiguous tokens: P=twd,i, …wd,iinu n ą 0 ‚ A partition over d-th document is a sequence of phrases: pPd,1, . . . , Pd,Gd q Gd ě 1 s.t. the concatenation of the phrase instances is the original document.
In Example 1, we can see the importance of word proximity in phrase recognition. As such, we place a contiguity restriction on our phrases.
To illustrate an induced partition upon a text segment, we can note how the concatenation of all single and multi-word phrases in Title 1 will yield an ordered sequence of tokens representing the original title.

定義1.我們正式定義短語以及其他必要的符號和術語,如下所示:,短語是連續標記的序列:P = twd,i,… wd,iinu n±0,第d個文件的分割區是 短語序列:pPd,1,…。 。 。 ,Pd,Gd q Gdě1 s.t. 短語範例的串聯是原始文件。
在範例1中,我們可以看到單詞近似在短語識別中的重要性。 因此,我們在詞組上設定了連續性限制。
爲了說明在文字段上的誘導劃分,我們可以注意到標題1中所有單詞和多詞短語的串聯將如何產生代表原始標題的標記的有序序列。

2.1 Desired Properties

2.1所需屬性

We outline the desired properties of a topical phrase mining algorithm as follows: ‚ The lists of phrases demonstrate a coherent topic.
‚ The phrases extracted are valid and human-interpretable.
‚ Each phrase is assigned a topic in a principled manner.
‚ The overall method is computationally efficient and of comparable complexity to LDA.
‚ The topic model demonstrates similar perplexity to LDA In addition to the above requirements for the system as a whole, we specify the requirements of a topic-representative phrase. When designing our phrase mining framework, we ensure that our phrase-mining and phrase-construction algorithms naturally validate candidate phrases on three qualities that constitute human-interpretability.

  1. Frequency: The most important quality when judging whether a phrase relays important information regarding a topic is its frequency of use within the topic. A phrase that is not frequent within a topic, is likely not important to the topic. This formulation can be interpreted as a generalization of the list of most probable unigrams visualization used for LDA to a list of the most probable phrases one will encounter in a given topic.
  2. Collocation: In corpus linguistics, a collocation refers to the co-occurence of tokens in such frequency that is significantly higher than what is expected due to chance. A commonly-used example of a phraseological-collocation 306 is the example of the two candidate collocations 「strong tea」 and 「powerful tea」[9]. One would assume that the two phrases appear in similar frequency, yet in the English language, the phrase 「strong tea」 is considered more correct and appears in much higher frequency. Because a collocation’s frequency deviates from what is expected, we consider them ‘interesting’ and informative. This insight motivates the necessity of analyzing our phrases probabilistically to ensure they are collocations.
  3. Completeness: If long frequent phrases satisfy the above criteria, then their subsets also satisfy these criteria. For example in the case of 「mining frequent patterns」, 「mining frequent」 will satisfy the frequency and collocation restriction, yet is clearly a subset of a larger and more intuitive phrase. Our phrase-construction algorithm should be able to automatically determine the most appropriate size for a human-interpretable phrase.
    We will introduce a framework that naturally embeds these phrase requirements.

我們概述了主題短語挖掘演算法的所需屬性,如下所示:,短語列表說明了一個連貫的主題。
,提取的短語是有效的並且可以解釋。
,原則上爲每個短語分配一個主題。
,整體方法計算效率高,複雜度可與LDA相提並論。
,主題模型展示了與LDA相似的困惑除了整個系統的上述要求之外,我們還指定了主題代表短語的要求。 在設計短語挖掘框架時,我們確保我們的短語挖掘和短語構造演算法可以自然地驗證構成人類可解釋性的三種品質的候選短語。
1.頻率:判斷一個短語是否傳達有關某個主題的重要資訊時,最重要的品質是其在主題中的使用頻率。 在主題內不常用的短語可能對該主題不重要。 這種表述可以解釋爲將用於LDA的最可能單字組視覺化列表的概括,概括爲在給定主題中將遇到的最可能短語的列表。
2.並置:在語料庫語言學中,並置是指代幣的共現頻率顯着高於偶然性所期望的頻率。 短語搭配306的一個常用範例是兩個候選搭配「濃茶」和「強力茶」的範例[9]。 可以假設這兩個詞組的出現頻率相似,但是在英語中,「濃茶」一詞被認爲更正確,並且出現頻率更高。 由於搭配的頻率與預期的偏離,因此我們認爲它們「有趣」且提供了很多資訊。 這種見解激發了對概率短語進行概率分析以確保它們搭配的必要性。
3.完整性:如果長的頻繁短語滿足上述條件,則其子集也滿足這些條件。 例如,在「頻繁挖掘模式」的情況下,「頻繁挖掘」將滿足頻率和搭配限制,但顯然是更大且更直觀的短語的子集。 我們的詞組構建演算法應該能夠自動確定人類可以理解的詞組的最合適大小。
我們將引入一個自然嵌入這些短語要求的框架。

  1. TOPMINE FRAMEWORK

  2. TOPMINE框架

To extract topical phrases that satisfy our desired requirements, we propose a framework that can be divided into two main parts: phrase-mining with text segmentation and phrase-constrained topic modeling. Our process for transforming a ‘bag-of-words’ document to a high-quality ‘bag-of-phrases’ involves first mining frequent phrases, and then using these phrases to segment each document through an agglomerative phrase construction algorithm. After inducing a partition on each document, we perform topic modeling to associate the same topic to each word in a phrase and thus naturally to the phrase as a whole.
The goal of our phrase mining is to collect aggregate statistics for our phrase-construction. The statistical significance measure uses these aggregates to guide segmentation of each document into phrases. This methodology leverages phrase context and phrase significance in the construction process ensuring all phrases are of high-quality. The resultant partition is the input for our phrase-constrained topic model.
We choose the ‘bag-of-phrases’ input over the traditional ‘bag-of-words’ because under the latter assumption, tokens in the same phrase can be assigned to different latent topics.
We address this by proposing a topic model PhraseLDA, which incorporates the ‘bag-of-phrases’ partition from our phrase mining algorithm as constraints in the topical inference process. We have derived a collapsed Gibb’s sampling method that when performing inference, ensures that tokens in the same phrase are assigned to the same topic. We expound upon our phrase-mining algorithm and topic model in Section 4 and Section 5 respectively.

爲了提取滿足我們期望的主題短語,我們提出了一個框架,該框架可以分爲兩個主要部分:帶文字分割的短語挖掘和短語約束主題建模。 我們將「單詞袋」文件轉換爲高品質「短語袋」的過程包括:首先挖掘常用短語,然後使用這些短語通過凝聚短語構造演算法對每個文件進行細分。 在對每個文件進行分割區之後,我們執行主題建模,以將同一主題與短語中的每個單詞相關聯,從而自然地與整個短語相關聯。
短語挖掘的目的是爲短語構造收集彙總統計資訊。 統計顯着性量度使用這些彙總來指導將每個文件細分爲短語。 該方法論在構建過程中利用了短語上下文和短語的重要性,以確保所有短語都是高品質的。 結果分割區是短語約束主題模型的輸入。
我們選擇「詞組」輸入而不是傳統的「詞袋」輸入,因爲在後一種假設下,可以將同一短語中的標記分配給不同的潛在主題。
我們通過提出主題模型PhraseLDA來解決此問題,該模型將短語挖掘演算法中的「詞組」劃分合併爲主題推理過程中的約束。 我們推導了一種摺疊的Gibb採樣方法,該方法在進行推理時可確保將相同短語中的標記分配給同一主題。 我們分別在第4節和第5節中闡述了短語挖掘演算法和主題模型。

  1. PHRASE MINING

4.短語挖掘

We present a phrase-mining algorithm that given a corpus of documents, merges the tokens within the document into human-interpretable phrases. Our method is purely datadriven allowing for great cross-domain performance and can operate on a variety of datasets. We extract high-quality phrases by obtaining counts of frequent contiguous patterns, then probabilistically reasoning about these patterns while applying context constraints to discover meaningful phrases.
The phrase mining algorithm can be broken down into two major steps. First, we mine the corpus for frequent candidate phrases and their aggregate counts. We have developed a technique that can quickly collect this information without traversing the prohibitively large search space. Second, we agglomeratively merge words in each document into quality phrases as guided by our significance measure. We will discuss these steps in greater detail in the next two subsections.

我們提出了一種短語挖掘演算法,該演算法給出了文件語料庫,將文件中的標記合併爲人類可解釋的短語。 我們的方法是純數據驅動的,可提供出色的跨域效能,並且可以在各種數據集上執行。 我們通過獲取頻繁連續模式的數量來提取高品質的短語,然後在應用上下文約束以發現有意義的短語的同時對這些模式進行概率推理。
短語挖掘演算法可以分爲兩個主要步驟。 首先,我們從語料庫中找出常用的候選短語及其總計數。 我們開發了一種技術,可以快速收集這些資訊,而無需遍歷過大的搜尋空間。 其次,按照我們的重要性衡量標準,我們將每個文件中的單詞合併爲高品質的短語。 我們將在接下來的兩個小節中詳細討論這些步驟。

4.1 Frequent Phrase Mining

4.1頻繁短語挖掘

In Algorithm 1, we present our frequent phrase mining algorithm. The task of frequent phrase mining can be defined as collecting aggregate counts for all contiguous words in a corpus that satisfy a certain minimum support threshold.
We draw upon two properties for efficiently mining these frequent phrases.

  1. Downward closure lemma: If phrase P is not frequent, then any super-phrase of P is guaranteed to be not frequent.

  2. Data-antimonotonicity: If a document contains no frequent phrases of length n, the document does not contain frequent phrases of length > n.
    The downward closure lemma was first introduced for mining general frequent patterns using the Apriori algorithm [1]. We can exploit this property for our case of phrases by maintaining a set of active indices. These active indices are a list of positions in a document at which a contiguous pattern of length n is frequent. In line 1 of Algorithm 1, we see the list of active indices.
    In addition, we use the data-antimonotonicity property to assess if a document should be considered for further mining [10]. If the document we are considering has been deemed to contain no more phrases of a certain length, then the document is guaranteed to contain no phrases of a longer length. We can safely remove it from any further consideration. These two pruning techniques work well with the natural sparsity of phrases and provide early termination of our algorithm without searching through the prohibitively large candidate phrase space.
    We take an increasing-size sliding window over the corpus to generate candidate phrases and obtain aggregate counts.
    At iteration k, for each document still in consideration, fixed-length candidate phrases beginning at each active index are counted using an appropriate hash-based counter.
    As seen in Algorithm 1 line 7, candidate phrases of length k´1 are pruned if they do not satisfy the minimum support threshold and their starting position is removed from the active indices. We refer to this implementation of the downward closure lemma as position-based Apriori pruning. As seen in Algorithm 1 lines 9 - 11, when a document contains no more active indices, it is removed from any further consideration. This second condition in addition to pruning the search for frequent phrases provides a natural termination criterion for our algorithm.
    The frequency criterion requires phrases to have sufficient occurrences. In general, we can set a minimum support that grows linearly with corpus size. The larger minimum support is, the more precision and the less recall is expected.

General frequent transaction pattern mining searches an exponential number of candidate patterns [1, 11]. When mining phrases, our contiguity requirement significantly reduces the number of candidate phrases generated.Wrost case time-complexity occurs when the entire document under consideration meets the minimum support threshold.
In this scenario, for a document d we generate OpN 2d q (a quadratic number) candidate phrases. Although this quadratic time and space complexity seems prohibitive, several properties can be used to ensure better performance. First, separating each document into smaller segments by splitting on phrase-invariant punctuation (commas, periods, semicolons, etc) allows us to consider constant-size chunks of text at a time. This effectively makes the overall complexity of our phrase mining algorithm linear, O(N), in relation to corpus size. The downward closure and data antimonotonicity pruning mechanisms serve to further reduce runtime.

[外連圖片轉存失敗,源站可能有防盜鏈機制 機製,建議將圖片儲存下來直接上傳(img-KSokid1Q-1597224334849)(C:\Users\qingbaobao\AppData\Roaming\Typora\typora-user-images\image-20200812171449719.png)]

在演算法1中,我們介紹了我們的頻繁短語挖掘演算法。 頻繁短語挖掘的任務可以定義爲收集語料庫中滿足某個最小支援閾值的所有連續單詞的合計計數。
我們利用兩個屬性來有效地挖掘這些常用短語。
1.向下閉合引理:如果短語P不頻繁,則可以保證P的任何超短語都不頻繁。
2.數據反單調性:如果文件不包含長度爲n的常用短語,則文件不包含長度> n的常用短語。
首先使用Apriori演算法[1]引入向下閉合引理,以挖掘一般的頻繁模式。 通過維護一組活動索引,我們可以在短語情況下利用此屬性。 這些活動索引是文件中頻繁出現長度爲n的連續模式的位置列表。 在演算法1的第1行中,我們看到了活動索引的列表。
另外,我們使用數據反單調性屬性來評估是否應考慮對文件進行進一步的挖掘[10]。 如果我們正在考慮的文件被認爲不再包含某個特定長度的短語,那麼可以保證該文件不包含更長的短語。 我們可以安全地將其從任何其他考慮中刪除。 這兩種修剪技術可以很好地與短語的自然稀疏性配合使用,並且可以在不搜尋非常大的候選短語空間的情況下提早終止我們的演算法。
我們在語料庫上增加滑動視窗的大小,以生成候選短語並獲得合計計數。
在迭代k中,對於仍在考慮中的每個文件,使用適當的基於雜湊的計數器對從每個活動索引處開始的固定長度候選短語進行計數。
從演算法1第7行可以看出,如果長度爲k´1的候選短語不滿足最小支援閾值,並且它們的起始位置已從有效索引中刪除,則將其修剪。 我們將向下關閉引理的這種實現方式稱爲基於位置的Apriori修剪。 如演算法1第9-11行所示,當文件不再包含活動索引時,便不再考慮該文件。 除了修剪對頻繁短語的搜尋之外,第二個條件爲我們的演算法提供了自然的終止標準。
頻率標準要求短語具有足夠的出現次數。 通常,我們可以設定一個最小支援,該支援隨語料庫大小線性增長。 最小支援越大,精度越高,召回次數也越少。

通用的頻繁事務模式挖掘可搜尋指數模式的候選模式[1,11]。 在挖掘短語時,我們的連續性要求大大減少了生成的候選短語的數量。當所考慮的整個文件都達到最小支援閾值時,將發生最複雜的情況。
在這種情況下,對於文件d,我們生成OpN 2d q(二次數)候選短語。 儘管這種二次時間和空間複雜性似乎令人望而卻步,但可以使用幾個屬性來確保更好的效能。 首先,通過按短語不變的標點符號(逗號,句號,分號等)將每個文件分成較小的段,使我們可以一次考慮大小不變的文字塊。 這有效地使我們的詞組挖掘演算法的整體複雜度相對於語料庫大小呈線性O(N)。 向下關閉和數據反單調修剪機制 機製可進一步減少執行時間。

4.2 Segmentation and Phrase Filtering

4.2細分和詞組過濾

Traditional phrase extraction methods filter low quality phrases by applying a heuristic 「importance」 ranking that reflect confidence in candidate key phrases, then only keeping the top-ranked phrases [21]. Some methods employ external knowledge bases or NLP constraints to filter out phrases [17, 21].
Our candidate phrase filtering step differentiates itself from traditional phrase extraction methods by implicitly filtering phrases in our document segmentation step. By returning to the context and constructing our phrases from the bottom-up, we can use phrase-context and the partition constraints to determine which phrase-instance was most likely intended. Because a document can contain at most a linear number of phrases (the number of terms in the document) and our frequent phrase mining algorithm may generate up to a quadratic number of candidate phrases, a quadratic number of bad candidate phrases can be eliminated by enforcing the partition constraint.

傳統的短語提取方法通過應用啓發式「重要性」排名來過濾低品質的短語,該排名反映出對候選關鍵短語的信心,然後僅保留排名靠前的短語[21]。 一些方法採用外部知識庫或NLP約束條件來過濾詞組[17,21]。
我們的候選短語過濾步驟通過在文件分割步驟中隱式過濾短語,使其與傳統短語提取方法有所不同。 通過返回上下文並從下至上構造我們的短語,我們可以使用短語上下文和分割區約束來確定最有可能使用哪個短語範例。 因爲文件最多可以包含線性數量的短語(文件中的術語數量),並且我們的頻繁短語挖掘演算法最多可以生成二次數量的候選短語,所以可以通過強制消除二次數量的不良候選短語 分割區約束。

The key element of this step is our bottom-up merging process. At each iteration, our algorithm makes locally optimal decisions in merging single and multi-word phrases as guided by a statistical significance score. In the next subsection, we present an agglomerative phrase-construction algorithm then explain how the significance of a potential merging is evaluated and how this significance guides our agglomerative merging algorithm.

此步驟的關鍵要素是我們的自下而上的合併過程。 在每次迭代中,我們的演算法都會按照統計顯着性得分的指導,在合併單詞和多詞短語時做出區域性最優決策。 在下一個小節中,我們介紹一種凝聚式短語構造演算法,然後說明如何評估潛在合併的重要性以及這種重要性如何指導我們的凝聚式合併演算法。

4.2.1 Phrase Construction Algorithm

4.2.1短語構造演算法

The main novelty in our phrase mining algorithm is the way we construct our high-quality phrases by inducing a partition upon each document. We employ a bottom-up agglomerative merging that greedily merges the best possible pair of candidate phrases at each iteration. This merging constructs phrases from single and multi-word phrases while maintaining the partition requirement. Because only phrases induced by the partition are valid phrases, we have implicitly filtered out phrases that may have passed the minimum support criterion by random chance.
In Algorithm 2, we present the phrase construction algorithm. The algorithm takes as input a document and the aggregate counts obtained from the frequent phrase mining algorithm. It then iteratively merges phrase instances with the strongest association as guided by a potential merging’s significance measure. The process is a bottom-up approach that upon termination induces a partition upon the original document creating a ‘bag-of-phrases’.

詞組挖掘演算法的主要創新之處在於,通過在每個文件上引入分割區來構造高品質詞組的方式。 我們採用了自下而上的聚集式合併,該合併會在每次迭代時貪婪地合併一對最佳的候選短語。 這種合併從單詞和多詞短語構造短語,同時保持分割區要求。 因爲只有分割區引起的短語是有效短語,所以我們隱式濾除了可能通過隨機機會通過最低支援標準的短語。
在演算法2中,我們介紹了詞組構建演算法。 該演算法將文件和從頻繁短語挖掘演算法獲得的總計數作爲輸入。 然後,它會根據潛在合併的重要性度量來反覆 反復合併具有最強關聯性的詞組範例。 該過程是一種自下而上的方法,該方法在終止時會導致對原始文件進行分割區,從而建立「短語袋」。

[外連圖片轉存失敗,源站可能有防盜鏈機制 機製,建議將圖片儲存下來直接上傳(img-3Ca0qWlW-1597224334853)(C:\Users\qingbaobao\AppData\Roaming\Typora\typora-user-images\image-20200812172034392.png)]

[外連圖片轉存失敗,源站可能有防盜鏈機制 機製,建議將圖片儲存下來直接上傳(img-L3IevsO9-1597224334856)(C:\Users\qingbaobao\AppData\Roaming\Typora\typora-user-images\image-20200812172047769.png)]

Figure 1 tracks the phrase construction algorithm by visualizing the agglomerative merging of phrases at each iteration with a dendogram. Operating on a paper title obtained from our dblp titles dataset, each level of the dendogram represents a single merging. At each iteration, our algorithm 308 selects two contiguous phrases such that their merging is of highest significance (Algorithm 2 line 4) and merges them (Algorithm 2 lines 6 - 9) . The following iteration then considers the newly merged phrase as a single unit. By considering each newly merged phrase as a single unit and assessing the significance of merging two phrases at each iteration, we successfully address the 「free-rider」 problem where long, unintelligible, phrases are evaluated as significant when comparing the occurrence of a phrase to the occurrence of each constituent term independently.
As all merged phrases are frequent phrases, we have fast access to the aggregate counts necessary to calculate the significance values for each potential merging. By using proper data structures, the contiguous pair with the highest significance can be selected and merged in logarithmic time, OplogpNdqq for each document. This complexity can once again be reduced by segmenting each document into smaller chunk by splitting on phrase-invariant punctuation. Our algorithm terminates when the next merging with the highest significance does not meet a predetermined significance threshold α or when all the terms have been merged into a single phrase. This is represented by the dashed line in Figure 1 where there are no more candidate phrases that meet the significance threshold. Upon termination, a natural 「bag-of-phrases」 partition remains. While the frequent phrase mining algorithm satisfies the frequency requirement, the phrase construction algorithm satisfies the collocation and completeness criterion.
To statistically reason about the occurrence of phrases, we consider a null hypothesis, that the corpus is generated from a series of independent Bernoulli trials. Under this hypothesis, the presence or absence of a phrase at a specific position in the corpus is a product of a Bernoulli random variable, and the expected number of occurrences of a phrase can be interpreted as a binomial random variable. Because the number of tokens L in the corpus can be assumed to be fairly large, this binomial can be reasonably approximated by a normal distribution. As such, the null hypothesis distribution, h0, for the random variable fpPq, the count of a phrase P within the corpus is:

[外連圖片轉存失敗,源站可能有防盜鏈機制 機製,建議將圖片儲存下來直接上傳(img-kfYaQWP6-1597224334859)(C:\Users\qingbaobao\AppData\Roaming\Typora\typora-user-images\image-20200812172317477.png)]

圖1通過使用樹狀圖視覺化每次迭代中短語的聚集合併來跟蹤短語構建演算法。 對從我們的dblp標題數據集獲得的紙質標題進行操作,每個樹狀圖代表一個合併。 在每次迭代中,我們的演算法308選擇兩個連續的短語,以使它們的合併具有最高的重要性(演算法2行4)併合並它們(演算法2行6-9)。 然後,以下迭代將新合併的短語視爲一個單元。 通過將每個新合併的短語視爲一個單元並評估每次迭代中合併兩個短語的重要性,我們成功地解決了「搭便車」問題,其中將較長的,難以理解的短語評估爲與 每個組成詞的出現獨立。
由於所有合併的短語都是常用短語,因此我們可以快速存取爲每個潛在合併計算重要性值所需的合計計數。 通過使用適當的數據結構,可以選擇對數最高的連續對,並以對數時間OplogpNdqq合併每個文件。 通過將短語不變的標點符號分割成每個較小的塊,可以再次降低這種複雜性。 當具有最高重要性的下一次合併未達到預定的重要性閾值α或所有術語已合併爲單個短語時,我們的演算法就會終止。 這由圖1中的虛線表示,其中沒有更多滿足重要性閾值的候選短語。 終止後,將保留一個自然的「短語袋」分割區。 在頻繁短語挖掘演算法滿足頻率要求的同時,短語構建演算法滿足設定和完整性準則。
爲了從統計學上對短語的出現進行推理,我們考慮了一個零假設,即該語料集是由一系列獨立的伯努利試驗生成的。 在此假設下,語料庫中特定位置上短語的存在與否是伯努利隨機變數的乘積,並且短語出現的預期次數可以解釋爲二項式隨機變數。 因爲可以假定語料庫中的標記數量L非常大,所以可以通過正態分佈合理地近似該二項式。 這樣,對於隨機變數fpPq的零假設分佈h0,語料庫中短語P的計數爲:

where ppPq is the Bernoulli trial success probability for phrase P. The empirical probability of a phrase in the corpus can be estimated as ppPq 「 fpP q L . Consider a longer phrase that composed of two phrases P1 and P2. The mean of its frequency under our null hypothesis of independence of the two phrases is:

其中ppPq是短語P的伯努利試驗成功概率。語料庫中短語的經驗概率可以估計爲ppPq「 fpP q L」。 考慮由兩個短語P1和P2組成的較長短語。 在兩個短語的獨立性的零假設下,其頻率的平均值爲:

o phrases P1 and P2. The mean of its frequency under our null hypothesis of independence of the two phrases is:

其中ppPq是短語P的伯努利試驗成功概率。語料庫中短語的經驗概率可以估計爲ppPq「 fpP q L」。 考慮由兩個短語P1和P2組成的較長短語。 在兩個短語的獨立性的零假設下,其頻率的平均值爲:

[外連圖片轉存失敗,源站可能有防盜鏈機制 機製,建議將圖片儲存下來直接上傳(img-Eu46cI3u-1597224334862)(C:\Users\qingbaobao\AppData\Roaming\Typora\typora-user-images\image-20200812172408060.png)]