處理文字資料（上）:詞袋

我們討論過表示資料屬性的兩種型別的特徵：連續特徵與分類特徵，前者用於描述數量，後者是固定列表中的元素。
第三種型別的特徵：文字

文字資料通常被表示為由字元組成的字串。

1、用字串表示的資料型別

文字通常只是資料集中的字串，但並非所有的字串特徵都應該被當作文字來處理。

字串特徵有時可以表示分類變數。在檢視資料之前，我們無法知道如何處理一個字串特徵。

⭐四種型別的字串資料：

1、分類資料
- 分類資料（categorical data）是來自固定列表的資料。
2、可以在語意上對映為類別的自由字串
- 你向用戶提供的不是一個下拉式選單，而是一個文字方塊，讓他們填寫自己最喜歡的顏色。
- 許多人的回答可能是像「黑色」或「藍色」之類的顏色名稱。其他人可能會出現筆誤，使用不同的單詞拼寫（比如「gray」和「grey」），或使用更加形象的具體名稱（比如「午夜藍色」）。
- 可能最好將這種資料編碼為分類變數，你可以利用最常見的條目來選擇類別，也可以自定義類別，使使用者回答對應用有意義。
3、結構化字串資料
- 手動輸入值不與固定的類別對應，但仍有一些內在的結構（structure），比如地址、人名或地名、日期、電話號碼或其他識別符號。
4、文字資料
- 例子包括推文、聊天記錄和酒店評論，還包括莎士比亞文集、維基百科的內容或古騰堡計劃收集的 50 000 本電子書。所有這些集合包含的資訊大多是由單片語成的句子。

2、範例應用：電影評論的情感分析

作為本章的一個執行範例，我們將使用由斯坦福研究員 Andrew Maas 收集的 IMDb （Internet Movie Database，網際網路電影資料庫）網站的電影評論資料集。

資料集連結：http://ai.stanford.edu/~amaas/data/sentiment/

這個資料集包含評論文字，還有一個標籤，用於表示該評論是「正面的」（positive）還是「負面的」（negative）。

IMDb 網站本身包含從 1 到 10 的打分。為了簡化建模，這些評論打分被歸納為一個二分類資料集，評分大於等於 7 的評論被標記為「正面的」，評分小於等於 4 的評論被標記為「負面的」，中性評論沒有包含在資料集中。

將資料解壓之後，資料集包括兩個獨立資料夾中的文字檔案，一個是訓練資料，一個是測試資料。每個資料夾又都有兩個子資料夾，一個叫作 pos，一個叫作 neg。

pos 資料夾包含所有正面的評論，每條評論都是一個單獨的文字檔案，neg 資料夾與之類似。scikit-learn 中有一個輔助函數可以載入用這種資料夾結構儲存的檔案，其中每個子資料夾對應於一個標籤，這個函數叫作 load_files。我們首先將 load_files 函數應用於訓練資料：

  from sklearn.datasets import load_files
  from sklearn.model_selection import train_test_split


  reviews_train = load_files("../../datasets/aclImdb/train/")
  # load_files 返回一個 Bunch 物件，其中包含訓練文字和訓練標籤

  #載入資料
  text_train,y_train = reviews_train.data,reviews_train.target

  #檢視資料
  print("type of text_train: {}".format(type(text_train)))
  print("length of text_train: {}".format(len(text_train)))
  print("text_train[6]:\n{}".format(text_train[6]))

  '''
  ```
  type of text_train: <class 'list'>
  length of text_train: 25000
  text_train[6]:
  b"This movie has a special way of telling the story, at first i found it rather odd as it jumped through time and I had no idea whats happening.<br /><br />Anyway the story line was although simple, but still very real and touching. You met someone the first time, you fell in love completely, but broke up at last and promoted a deadly agony. Who hasn't go through this? but we will never forget this kind of pain in our life. <br /><br />I would say i am rather touched as two actor has shown great performance in showing the love between the characters. I just wish that the story could be a happy ending."
  ```
  '''