# 人工智慧資料準備

## 預處理資料

``````import numpy as np
sklearn import preprocessing
``````

• NumPy - 基本上NumPy是一種通用的陣列處理軟體包，設計用於高效處理任意記錄的大型多維陣列而不犧牲小型多維陣列的速度。
• sklearn.preprocessing - 此包提供了許多常用的實用函式和變換器類，用於將原始特徵向量更改為更適合機器學習演算法的表示形式。

``````input_data = np.array([2.1, -1.9, 5.5],
[-1.5, 2.4, 3.5],
[0.5, -7.9, 5.6],
[5.9, 2.3, -5.8]])
``````

## 資料預處理技術

``````data_binarized = preprocessing.Binarizer(threshold = 0.5).transform(input_data)
print("\nBinarized data:\n", data_binarized)
``````

``````[[ 1. 0. 1.]
[ 0. 1. 1.]
[ 0. 0. 1.]
[ 1. 1. 0.]]
``````

``````print("Mean = ", input_data.mean(axis = 0))
print("Std deviation = ", input_data.std(axis = 0))
``````

``````Mean = [ 1.75       -1.275       2.2]
Std deviation = [ 2.71431391  4.20022321  4.69414529]
``````

``````data_scaled = preprocessing.scale(input_data)
print("Mean =", data_scaled.mean(axis=0))
print("Std deviation =", data_scaled.std(axis = 0))
``````

``````Mean = [ 1.11022302e-16 0.00000000e+00 0.00000000e+00]
Std deviation = [ 1.             1.             1.]
``````

``````data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print ("\nMin max scaled data:\n", data_scaled_minmax)
``````

``````[ [ 0.48648649  0.58252427   0.99122807]
[   0.          1.           0.81578947]
[   0.27027027  0.           1.        ]
[   1.          0. 99029126  0.        ]]
``````

### 正常化

L1標準化

``````# Normalize data
data_normalized_l1 = preprocessing.normalize(input_data, norm = 'l1')
print("\nL1 normalized data:\n", data_normalized_l1)
``````

``````L1 normalized data:
[[ 0.22105263  -0.2          0.57894737]
[ -0.2027027    0.32432432   0.47297297]
[  0.03571429  -0.56428571   0.4       ]
[  0.42142857   0.16428571  -0.41428571]]
``````

L2標準化

``````# Normalize data
data_normalized_l2 = preprocessing.normalize(input_data, norm = 'l2')
print("\nL2 normalized data:\n", data_normalized_l2)
``````

``````L2 normalized data:
[[ 0.33946114  -0.30713151   0.88906489]
[ -0.33325106   0.53320169   0.7775858 ]
[  0.05156558  -0.81473612   0.57753446]
[  0.68706914   0.26784051  -0.6754239 ]]
``````

## 標記資料

``````import numpy as np
from sklearn import preprocessing
``````

``````# Sample input labels
input_labels = ['red','black','red','green','black','yellow','white']
``````

``````# Creating the label encoder
encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)
``````

``````LabelEncoder()
``````

``````# encoding a set of labels
test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)
``````

``````Labels = ['green', 'red', 'black']
``````

``````print("Encoded values =", list(encoded_values))
``````

``````Encoded values = [1, 2, 0]
``````

``````# decoding a set of values
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)
print("\nEncoded values =", encoded_values)
``````

``````Encoded values = [3, 0, 4, 1]
print("\nDecoded labels =", list(decoded_list))
``````

``````Decoded labels = ['white', 'black', 'yellow', 'green']
``````