機器學習演演算法（一）: 基於邏輯迴歸的分類預測

專案連結參考fork一下直接執行：https://www.heywhale.com/home/column/64141d6b1c8c8b518ba97dcc

1 邏輯迴歸的介紹和應用

1.1 邏輯迴歸的介紹

邏輯迴歸（Logistic regression，簡稱LR）雖然其中帶有"迴歸"兩個字，但邏輯迴歸其實是一個分類模型，並且廣泛應用於各個領域之中。雖然現在深度學習相對於這些傳統方法更為火熱，但實則這些傳統方法由於其獨特的優勢依然廣泛應用於各個領域中。

而對於邏輯迴歸而且，最為突出的兩點就是其模型簡單和模型的可解釋性強。

邏輯迴歸模型的優劣勢:

優點：實現簡單，易於理解和實現；計算代價不高，速度很快，儲存資源低；
缺點：容易欠擬合，分類精度可能不高

1.1 邏輯迴歸的應用

邏輯迴歸模型廣泛用於各個領域，包括機器學習，大多數醫學領域和社會科學。例如，最初由Boyd 等人開發的創傷和損傷嚴重度評分（TRISS）被廣泛用於預測受傷患者的死亡率，使用邏輯迴歸基於觀察到的患者特徵（年齡，性別，體重指數,各種血液檢查的結果等）分析預測發生特定疾病（例如糖尿病，冠心病）的風險。邏輯迴歸模型也用於預測在給定的過程中，系統或產品的故障的可能性。還用於市場行銷應用程式，例如預測客戶購買產品或中止訂購的傾向等。在經濟學中它可以用來預測一個人選擇進入勞動力市場的可能性，而商業應用則可以用來預測房主拖欠抵押貸款的可能性。條件隨機欄位是邏輯迴歸到順序資料的擴充套件，用於自然語言處理。

邏輯迴歸模型現在同樣是很多分類演演算法的基礎元件,比如分類任務中基於GBDT演演算法+LR邏輯迴歸實現的信用卡交易反欺詐，CTR(點選通過率)預估等，其好處在於輸出值自然地落在0到1之間，並且有概率意義。模型清晰，有對應的概率學理論基礎。它擬合出來的引數就代表了每一個特徵(feature)對結果的影響。也是一個理解資料的好工具。但同時由於其本質上是一個線性的分類器，所以不能應對較為複雜的資料情況。很多時候我們也會拿邏輯迴歸模型去做一些任務嘗試的基線（基礎水平）。

說了這些邏輯迴歸的概念和應用，大家應該已經對其有所期待了吧，那麼我們現在開始吧！！！

2 學習目標

瞭解邏輯迴歸的理論
掌握邏輯迴歸的 sklearn 函數呼叫使用並將其運用到鳶尾花資料集預測

3 程式碼流程

Part1 Demo實踐
- Step1:庫函數匯入
- Step2:模型訓練
- Step3:模型引數檢視
- Step4:資料和模型視覺化
- Step5:模型預測
Part2 基於鳶尾花（iris）資料集的邏輯迴歸分類實踐
- Step1:庫函數匯入
- Step2:資料讀取/載入
- Step3:資料資訊簡單檢視
- Step4:視覺化描述
- Step5:利用邏輯迴歸模型在二分類上進行訓練和預測
- Step5:利用邏輯迴歸模型在三分類(多分類)上進行訓練和預測

4 演演算法實戰

4.1 Demo實踐

Step1:庫函數匯入

##  基礎函數庫
import numpy as np 

## 匯入畫相簿
import matplotlib.pyplot as plt
import seaborn as sns

## 匯入邏輯迴歸模型函數
from sklearn.linear_model import LogisticRegression

Step2:模型訓練

##Demo演示LogisticRegression分類

## 構造資料集
x_fearures = np.array([[-1, -2], [-2, -1], [-3, -2], [1, 3], [2, 1], [3, 2]])
y_label = np.array([0, 0, 0, 1, 1, 1])

## 呼叫邏輯迴歸模型
lr_clf = LogisticRegression()

## 用邏輯迴歸模型擬合構造的資料集
lr_clf = lr_clf.fit(x_fearures, y_label) #其擬合方程為 y=w0+w1*x1+w2*x2

Step3:模型引數檢視

## 檢視其對應模型的w
print('the weight of Logistic Regression:',lr_clf.coef_)

## 檢視其對應模型的w0
print('the intercept(w0) of Logistic Regression:',lr_clf.intercept_)

the weight of Logistic Regression: [[0.73455784 0.69539712]]
the intercept(w0) of Logistic Regression: [-0.13139986]

Step4:資料和模型視覺化

## 視覺化構造的資料樣本點
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')
plt.show()

# 視覺化決策邊界
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

nx, ny = 200, 100
x_min, x_max = plt.xlim()
y_min, y_max = plt.ylim()
x_grid, y_grid = np.meshgrid(np.linspace(x_min, x_max, nx),np.linspace(y_min, y_max, ny))

z_proba = lr_clf.predict_proba(np.c_[x_grid.ravel(), y_grid.ravel()])
z_proba = z_proba[:, 1].reshape(x_grid.shape)
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

### 視覺化預測新樣本

plt.figure()
## new point 1
x_fearures_new1 = np.array([[0, -1]])
plt.scatter(x_fearures_new1[:,0],x_fearures_new1[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 1',xy=(0,-1),xytext=(-2,0),color='blue',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## new point 2
x_fearures_new2 = np.array([[1, 2]])
plt.scatter(x_fearures_new2[:,0],x_fearures_new2[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 2',xy=(1,2),xytext=(-1.5,2.5),color='red',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## 訓練樣本
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

# 視覺化決策邊界
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

Step5:模型預測

## 在訓練集和測試集上分別利用訓練好的模型進行預測
y_label_new1_predict = lr_clf.predict(x_fearures_new1)
y_label_new2_predict = lr_clf.predict(x_fearures_new2)

print('The New point 1 predict class:\n',y_label_new1_predict)
print('The New point 2 predict class:\n',y_label_new2_predict)

## 由於邏輯迴歸模型是概率預測模型（前文介紹的 p = p(y=1|x,\theta)）,所以我們可以利用 predict_proba 函數預測其概率
y_label_new1_predict_proba = lr_clf.predict_proba(x_fearures_new1)
y_label_new2_predict_proba = lr_clf.predict_proba(x_fearures_new2)

print('The New point 1 predict Probability of each class:\n',y_label_new1_predict_proba)
print('The New point 2 predict Probability of each class:\n',y_label_new2_predict_proba)

The New point 1 predict class:
 [0]
The New point 2 predict class:
 [1]
The New point 1 predict Probability of each class:
 [[0.69567724 0.30432276]]
The New point 2 predict Probability of each class:
 [[0.11983936 0.88016064]]

可以發現訓練好的迴歸模型將X_new1預測為了類別0（判別面左下側），X_new2預測為了類別1（判別面右上側）。其訓練得到的邏輯迴歸模型的概率為0.5的判別面為上圖中藍色的線。

4.2 基於鳶尾花（iris）資料集的邏輯迴歸分類實踐

在實踐的最開始，我們首先需要匯入一些基礎的函數庫包括：numpy （Python進行科學計算的基礎軟體包），pandas（pandas是一種快速，強大，靈活且易於使用的開源資料分析和處理工具），matplotlib和seaborn繪圖。
Step1:庫函數匯入

##  基礎函數庫
import numpy as np 
import pandas as pd

## 繪圖函數庫
import matplotlib.pyplot as plt
import seaborn as sns

本次我們選擇鳶花資料（iris）進行方法的嘗試訓練，該資料集一共包含5個變數，其中4個特徵變數，1個目標分類變數。共有150個樣本，目標變數為花的類別其都屬於鳶尾屬下的三個亞屬，分別是山鳶尾 (Iris-setosa)，變色鳶尾(Iris-versicolor)和維吉尼亞鳶尾(Iris-virginica)。包含的三種鳶尾花的四個特徵，分別是花萼長度(cm)、花萼寬度(cm)、花瓣長度(cm)、花瓣寬度(cm)，這些形態特徵在過去被用來識別物種。

變數	描述
sepal length	花萼長度(cm)
sepal width	花萼寬度(cm)
petal length	花瓣長度(cm)
petal width	花瓣寬度(cm)
target	鳶尾的三個亞屬類別,'setosa'(0), 'versicolor'(1), 'virginica'(2)

Step2:資料讀取/載入

## 我們利用 sklearn 中自帶的 iris 資料作為資料載入，並利用Pandas轉化為DataFrame格式
from sklearn.datasets import load_iris
data = load_iris() #得到資料特徵
iris_target = data.target #得到資料對應的標籤
iris_features = pd.DataFrame(data=data.data, columns=data.feature_names) #利用Pandas轉化為DataFrame格式

Step3:資料資訊簡單檢視

## 利用.info()檢視資料的整體資訊
iris_features.info()

#   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB

## 對於特徵進行一些統計描述
iris_features.describe()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Step4:視覺化描述

## 合併標籤和特徵資訊
iris_all = iris_features.copy() ##進行淺拷貝，防止對於原始資料的修改
iris_all['target'] = iris_target

## 特徵與標籤組合的散點視覺化
sns.pairplot(data=iris_all,diag_kind='hist', hue= 'target')
plt.show()

從上圖可以發現，在2D情況下不同的特徵組合對於不同類別的花的散點分佈，以及大概的區分能力。

for col in iris_features.columns:
    sns.boxplot(x='target', y=col, saturation=0.5,palette='pastel', data=iris_all)
    plt.title(col)
    plt.show()

利用箱型圖我們也可以得到不同類別在不同特徵上的分佈差異情況。

# 選取其前三個特徵繪製三維散點圖
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection='3d')

iris_all_class0 = iris_all[iris_all['target']==0].values
iris_all_class1 = iris_all[iris_all['target']==1].values
iris_all_class2 = iris_all[iris_all['target']==2].values
# 'setosa'(0), 'versicolor'(1), 'virginica'(2)
ax.scatter(iris_all_class0[:,0], iris_all_class0[:,1], iris_all_class0[:,2],label='setosa')
ax.scatter(iris_all_class1[:,0], iris_all_class1[:,1], iris_all_class1[:,2],label='versicolor')
ax.scatter(iris_all_class2[:,0], iris_all_class2[:,1], iris_all_class2[:,2],label='virginica')
plt.legend()

plt.show()

Step5:利用邏輯迴歸模型在二分類上進行訓練和預測

## 為了正確評估模型效能，將資料劃分為訓練集和測試集，並在訓練集上訓練模型，在測試集上驗證模型效能。
from sklearn.model_selection import train_test_split

## 選擇其類別為0和1的樣本 （不包括類別為2的樣本）
iris_features_part = iris_features.iloc[:100]
iris_target_part = iris_target[:100]

## 測試集大小為20%， 80%/20%分
x_train, x_test, y_train, y_test = train_test_split(iris_features_part, iris_target_part, test_size = 0.2, random_state = 2020)

## 從sklearn中匯入邏輯迴歸模型
from sklearn.linear_model import LogisticRegression

## 定義 邏輯迴歸模型 
clf = LogisticRegression(random_state=0, solver='lbfgs')

# 在訓練集上訓練邏輯迴歸模型
clf.fit(x_train, y_train)

## 檢視其對應的w
print('the weight of Logistic Regression:',clf.coef_)

## 檢視其對應的w0
print('the intercept(w0) of Logistic Regression:',clf.intercept_)

## 在訓練集和測試集上分佈利用訓練好的模型進行預測
train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

from sklearn import metrics

## 利用accuracy（準確度）【預測正確的樣本數目佔總預測樣本數目的比例】評估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## 檢視混淆矩陣 (預測值和真實值的各類情況統計矩陣)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# 利用熱力圖對於結果進行視覺化
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

The accuracy of the Logistic Regression is: 1.0
The accuracy of the Logistic Regression is: 1.0
The confusion matrix result:
[[ 9 0]
[ 0 11]]

我們可以發現其準確度為1，代表所有的樣本都預測正確了。

Step6:利用邏輯迴歸模型在三分類(多分類)上進行訓練和預測

## 測試集大小為20%， 80%/20%分
x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target, test_size = 0.2, random_state = 2020)
## 定義 邏輯迴歸模型 
clf = LogisticRegression(random_state=0, solver='lbfgs')

# 在訓練集上訓練邏輯迴歸模型
clf.fit(x_train, y_train)

## 檢視其對應的w
print('the weight of Logistic Regression:\n',clf.coef_)

## 檢視其對應的w0
print('the intercept(w0) of Logistic Regression:\n',clf.intercept_)

## 由於這個是3分類，所有我們這裡得到了三個邏輯迴歸模型的引數，其三個邏輯迴歸組合起來即可實現三分類。


## 在訓練集和測試集上分佈利用訓練好的模型進行預測
train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

## 由於邏輯迴歸模型是概率預測模型（前文介紹的 p = p(y=1|x,\theta)）,所有我們可以利用 predict_proba 函數預測其概率
train_predict_proba = clf.predict_proba(x_train)
test_predict_proba = clf.predict_proba(x_test)

print('The test predict Probability of each class:\n',test_predict_proba)
## 其中第一列代表預測為0類的概率，第二列代表預測為1類的概率，第三列代表預測為2類的概率。

## 利用accuracy（準確度）【預測正確的樣本數目佔總預測樣本數目的比例】評估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

 [9.35695863e-01 6.43039513e-02 1.85301359e-07]
 [9.80621190e-01 1.93787400e-02 7.00125246e-08]
 [1.68478815e-04 3.30167226e-01 6.69664295e-01]
 [3.54046163e-03 4.02267805e-01 5.94191734e-01]
 [9.70617284e-01 2.93824740e-02 2.42443967e-07]
...
 [9.64848137e-01 3.51516748e-02 1.87917880e-07]
 [9.70436779e-01 2.95624025e-02 8.18591606e-07]]
The accuracy of the Logistic Regression is: 0.9833333333333333
The accuracy of the Logistic Regression is: 0.8666666666666667

## 檢視混淆矩陣
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# 利用熱力圖對於結果進行視覺化
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

通過結果我們可以發現，其在三分類的結果的預測準確度上有所下降，其在測試集上的準確度為:$86.67%$，這是由於'versicolor'（1）和 'virginica'（2）這兩個類別的特徵，我們從視覺化的時候也可以發現，其特徵的邊界具有一定的模糊性（邊界類別混雜，沒有明顯區分邊界），所有在這兩類的預測上出現了一定的錯誤。

5 重要知識點

邏輯迴歸原理簡介：

Logistic迴歸雖然名字裡帶「迴歸」，但是它實際上是一種分類方法，主要用於兩分類問題（即輸出只有兩種，分別代表兩個類別），所以利用了Logistic函數（或稱為Sigmoid函數），函數形式為：
$$
logi(z)=\frac{1}{1+e^{-z}}
$$

其對應的函數影象可以表示如下:

import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-5,5,0.01)
y = 1/(1+np.exp(-x))

plt.plot(x,y)
plt.xlabel('z')
plt.ylabel('y')
plt.grid()
plt.show()

通過上圖我們可以發現 Logistic 函數是單調遞增函數，並且在z=0的時候取值為0.5，並且$logi(\cdot)$函數的取值範圍為$(0,1)$。

而回歸的基本方程為$z=w_0+\sum_i^N w_ix_i$，

將回歸方程寫入其中為：
$$
p = p(y=1|x,\theta) = h_\theta(x,\theta)=\frac{1}{1+e^{{-(w_0+\sum_i}N w_ix_i)}}
$$

所以, $p(y=1|x,\theta) = h_\theta(x,\theta)$，$p(y=0|x,\theta) = 1-h_\theta(x,\theta)$

邏輯迴歸從其原理上來說，邏輯迴歸其實是實現了一個決策邊界：對於函數 $y=\frac{1}{1+e^{-z}}$,當 $z=>0$時,$y=>0.5$,分類為1，當 $z<0$時,$y<0.5$,分類為0，其對應的$y$值我們可以視為類別1的概率預測值.

對於模型的訓練而言：實質上來說就是利用資料求解出對應的模型的特定的$w$。從而得到一個針對於當前資料的特徵邏輯迴歸模型。

而對於多分類而言，將多個二分類的邏輯迴歸組合，即可實現多分類。