定量資料和定性資料

2023-11-19 21:00:48

定量資料本質上是數值,應該是衡量某樣東西的數量。
定性資料本質上是類別,應該是描述某樣東西的性質。

全部的資料列如下,其中既有定性列也有定量列;

import pandas as pd

pd.options.display.max_columns = None
pd.set_option('expand_frame_repr', False)
salary_ranges = pd.read_csv('./data/Salary_Ranges_by_Job_Classification.csv')
print(salary_ranges.head())
#    SetID JobCode                Eff Date              SalEndDate SalarySetID SalPlan Grade  Step BiweeklyHighRate BiweeklyLowRate  UnionCode  ExtendedStep PayType
# 0  COMMN     109  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1            $0.00           $0.00        330             0       C
# 1  COMMN     110  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1           $15.00          $15.00        323             0       D
# 2  COMMN     111  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1           $25.00          $25.00        323             0       D
# 3  COMMN     112  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1           $50.00          $50.00        323             0       D
# 4  COMMN     114  07/01/2009 12:00:00 AM  06/30/2010 12:00:00 AM       COMMN     SFM     0     1          $100.00         $100.00        323             0       M

.info()可以瞭解資料的列資訊以及每列非null的行數;

print(salary_ranges.info())

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1356 entries, 0 to 1355
# Data columns (total 13 columns):
#  #   Column              Non-Null Count  Dtype
# ---  ------              --------------  -----
#  0   SetID               1356 non-null   object
#  1   Job Code            1356 non-null   object
#  2   Eff Date            1356 non-null   object
#  3   Sal End Date        1356 non-null   object
#  4   Salary SetID        1356 non-null   object
#  5   Sal Plan            1356 non-null   object
#  6   Grade               1356 non-null   object
#  7   Step                1356 non-null   int64
#  8   Biweekly High Rate  1356 non-null   object
#  9   Biweekly Low Rate   1356 non-null   object
#  10  Union Code          1356 non-null   int64
#  11  Extended Step       1356 non-null   int64
#  12  Pay Type            1356 non-null   object
# dtypes: int64(3), object(10)
# memory usage: 137.8+ KB
# None

也可以使用以下方法更快速的計算缺失值的資訊;

print(salary_ranges.isnull().sum())
# SetID                 0
# Job Code              0
# Eff Date              0
# Sal End Date          0
# Salary SetID          0
# Sal Plan              0
# Grade                 0
# Step                  0
# Biweekly High Rate    0
# Biweekly Low Rate     0
# Union Code            0
# Extended Step         0
# Pay Type              0
# dtype: int64

describe方法檢視定量資料的描述性統計;Pandas認為,資料只有3個定量列:Step、Union Code和Extended Step(步進、工會程式碼和增強步進)。先不說步進和增強步進,很明顯工會程式碼不是定量的。雖然這一列是數,但這些數不代表數量,只代表某個工會的程式碼

print( salary_ranges.describe())

#               Step   Union Code  Extended Step
# count  1356.000000  1356.000000    1356.000000
# mean      1.294985   392.676991       0.150442
# std       1.045816   338.100562       1.006734
# min       1.000000     1.000000       0.000000
# 25%       1.000000    21.000000       0.000000
# 50%       1.000000   351.000000       0.000000
# 75%       1.000000   790.000000       0.000000
# max       5.000000   990.000000      11.000000

最值得注意的特徵是一個定量列Biweekly High Rate(雙週最高工資)和一個定性列Grade(工作種類);

salary_ranges = salary_ranges[['BiweeklyHighRate', 'Grade']]
print(salary_ranges.head())

#   BiweeklyHighRate Grade
# 0            $0.00     0
# 1           $15.00     0
# 2           $25.00     0
# 3           $50.00     0
# 4          $100.00     0

檢視兩個欄位的型別;

print(salary_ranges.info())

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1356 entries, 0 to 1355
# Data columns (total 2 columns):
#  #   Column            Non-Null Count  Dtype
# ---  ------            --------------  -----
#  0   BiweeklyHighRate  1356 non-null   object
#  1   Grade             1356 non-null   object
# dtypes: object(2)
# memory usage: 21.3+ KB
# None

我們清理一下資料,移除工資前面的美元符號,保證資料型別正確。當處理定量資料時,一般使用整數或浮點數作為型別(最好使用浮點數);定性資料則一般使用字串或Unicode物件。

salary_ranges['BiweeklyHighRate'] = salary_ranges['BiweeklyHighRate'].map(lambda value:value.replace('$',''))
print(salary_ranges.head())

#   BiweeklyHighRate Grade
# 0             0.00     0
# 1            15.00     0
# 2            25.00     0
# 3            50.00     0
# 4           100.00     0

資料型別並沒有變

print(salary_ranges.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1356 entries, 0 to 1355
# Data columns (total 2 columns):
#  #   Column            Non-Null Count  Dtype 
# ---  ------            --------------  ----- 
#  0   BiweeklyHighRate  1356 non-null   object
#  1   Grade             1356 non-null   object
# dtypes: object(2)
# memory usage: 21.3+ KB
# None

將BiweeklyHighRate和Grade列中的資料分別轉換為浮點數、字串;

salary_ranges['BiweeklyHighRate'] = salary_ranges['BiweeklyHighRate'].astype(float)
salary_ranges['Grade'] = salary_ranges['Grade'].astype(str)
print(salary_ranges.info())

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1356 entries, 0 to 1355
# Data columns (total 2 columns):
#  #   Column            Non-Null Count  Dtype
# ---  ------            --------------  -----
#  0   BiweeklyHighRate  1356 non-null   float64
#  1   Grade             1356 non-null   object
# dtypes: float64(1), object(1)
# memory usage: 21.3+ KB
# None