Python之sklearn:GridSearchCV()函數的簡介、具體案例、使用方法詳細攻略

2020-10-18 16:00:39

Python之sklearn:GridSearchCV()函數的簡介、具體案例、使用方法詳細攻略

 

 

 

目錄

GridSearchCV()函數的簡介、具體案例、使用方法詳細攻略


 

 

 

 

GridSearchCV()函數的簡介、具體案例、使用方法詳細攻略

class GridSearchCV Found at: sklearn.model_selection._search

class GridSearchCV(BaseSearchCV):
    """Exhaustive search over specified parameter values for an estimator.
    Important members are fit, predict.GridSearchCV implements a "fit" and a "score" method. It also implements "predict", "predict_proba", "decision_function", "transform" and "inverse_transform" if they are implemented in the  estimator used. The parameters of the estimator used to apply these methods are  optimized by cross-validated grid-search over a parameter grid.
    Read more in the :ref:`User Guide <grid_search>`.

在以下位置找到GridSearchCV類:sklearn.model_selection._search
GridSearchCV類(BaseSearchCV):
「」「詳盡搜尋指定引數的估計值
重要的成員是fit,predict.GridSearchCV實現「 fit」和「 score」方法。 如果在使用的估算器中實現了``predict'',`predict_proba'',``decision_function'',``transform''和``inverse_transform'',則還可以實現它們。 通過對引數網格進行交叉驗證的網格搜尋來優化用於應用這些方法的估計器的引數
在:ref:ʻ使用者指南<grid_search>`中瞭解更多資訊。

   Parameters
    ----------
    estimator : estimator object. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.
    
    param_grid : dict or list of dictionaries. Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
    
    scoring : str, callable, list/tuple or dict, default=None. A single str (see :ref:`scoring_parameter`) or a callable (see :ref:`scoring`) to evaluate the predictions on the test set. 
     For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
    NOTE that when using custom scorers, each scorer should return a  single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
    See :ref:`multimetric_grid_search` for an example.
    If None, the estimator's score method is used.
    
    n_jobs : int, default=None.  Number of jobs to run in parallel.  ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.  ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.
    .. versionchanged:: v0.20.  `n_jobs` default changed from 1 to None
    
    pre_dispatch : int, or str, default=n_jobs. Controls the number of jobs that get dispatched during parallel  execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
    - None, in which case all the jobs are immediately  created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
    - An int, giving the exact number of total jobs that are spawned
    - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'
    
    iid : bool, default=False.  If True, return the average score across folds, weighted by the number  of samples in each test set. In this case, the data is assumed to be identically distributed across the folds, and the loss minimized is  the total loss per sample, and not the mean loss across the folds.
    .. deprecated:: 0.22. Parameter ``iid`` is deprecated in 0.22 and will be removed in 0.24
    
    cv : int, cross-validation generator or an iterable, default=None. Determines the cross-validation splitting strategy.  Possible inputs for cv are:
    - None, to use the default 5-fold cross validation,
    - integer, to specify the number of folds in a `(Stratified)KFold`,
    - :term:`CV splitter`,
    - An iterable yielding (train, test) splits as arrays of indices.
    For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.
    Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.
    .. versionchanged:: 0.22. ``cv`` default value if None changed from 3-fold to 5-fold.
    
    refit : bool, str, or callable, default=True. Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.
       Where there are considerations other than maximum score in     choosing a best estimator, ``refit`` can be set to a function which     returns the selected ``best_index_`` given ``cv_results_``. In that     case, the ``best_estimator_`` and ``best_params_`` will be set     according to the returned ``best_index_`` while the ``best_score_``     attribute will not be available.
    The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance.
    Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer.
    See ``scoring`` parameter to know more about multiple metric evaluation.  .. versionchanged:: 0.20. Support for callable added.
    
    verbose : integer. Controls the verbosity: the higher, the more messages.
    
    error_score : 'raise' or numeric, default=np.nan. Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit  step, which will always raise the error.
    
    return_train_score : bool, default=False.  If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that  yield the best generalization performance.
    .. versionadded:: 0.19
    .. versionchanged:: 0.21. Default value was changed from ``True`` to ``False``

引數
----------
estimator :估計器物件。假定這樣做是為了實現scikit-learn估計器介面。估算器需要提供一個「得分」功能,或者必須傳遞「得分」。

 

param_grid :字典或字典列表。使用引數名稱(`str`)作為鍵的字典,以及將嘗試用作值的引數設定列表,或此類字典的列表,在這種情況下,將探索列表中每個字典所跨越的網格。這樣可以搜尋任何順序的引數設定

 

scoring :str,可呼叫,列表/元組或字典,預設=無。單個str(請參閱scoring_parameter)或可呼叫項(請參閱scoring)來評估測試集上的預測。
      要評估多個指標,請給出(唯一的)字串列表或以名稱為鍵,將可呼叫項為值的字典。

       請注意,使用自定義計分器時,每個計分器應返回一個單個值。返回值列表/陣列的度量函數可以包裝到多個計分器中,每個計分器都返回一個值。
有關範例,請參見multimetric_grid_search。
        如果為None,則使用估算器的計分方法

 

n_jobs :int,預設=無。要並行執行的作業數。除非在:obj:`joblib.parallel_backend`上下文中,否則「 None``表示1。 -1表示使用所有處理器。有關更多詳細資訊,請參見術語<n_jobs>`。

..版本已更改:: v0.20。 `n_jobs`預設從1更改為None

 

pre_dispatch 或str,預設= n_jobs。控制在並行執行期間分派的作業數量。當排程的作業數量超過CPU的處理能力時,減少此數量可能有助於避免記憶體消耗激增。該引數可以是:

-None,在這種情況下,將立即建立併產生所有作業。使用它進行輕量級和快速執行的作業,以避免因按需生成作業而造成延遲
-一個int,給出產生的確切總工作數
-一個str,根據n_jobs給出表示式,如'2 * n_jobs'

 

iid :bool,預設= False。如果為True,則按倍數返回平均得分,並按每個測試集中的樣本數加權。在這種情況下,假設資料在摺痕上分佈相同,並且最小化的損失是每個樣品的總損失,而不是摺痕的平均損失
..不建議使用:: 0.22。引數「 iid」在0.22中已棄用,在0.24中將被刪除

 

cv :int,交叉驗證生成器或可迭代的default = None。確定交叉驗證拆分策略。簡歷的可能輸入是:
-None,要使用預設的5-fold交叉驗證
-integer整數,用於指定「(分層)KFold」中的摺疊次數,
-:CV splitter`,
-可迭代的yielding (訓練,測試)拆分為索引陣列。
      對於整數/無輸入,如果估計器是分類器,而y是二進位制或多類,則使用:StratifiedKFold。在所有其他情況下,都使用KFold類。
請參閱:ref:ʻ使用者指南<cross_validation>`,瞭解可以在此處使用的各種交叉驗證策略。
..版本已更改:: 0.22。如果無從3倍更改為5倍,則為cv預設值。

 

refit :bool,str或callable,預設為True。使用在整個資料集中找到的最佳引數重新擬合估算器。對於多指標評估,這需要是一個「 str」,表示計分器,該計分器將被用於尋找最佳引數,以最終擬合估計器
在選擇最佳估算器時,除了最大分數以外,還可以將``refit''設定為一個函數,該函數在給定``cv_results_''的情況下返回所選的``best_index_''。在這種情況下,將根據返回的``best_index_''設定``best_estimator_''和``best_params_'',而``best_score_''屬性將不可用
可以在「 best_estimator_」屬性中使用經過重新調整的估計器,並允許在此「 GridSearchCV」範例上直接使用「預測」。
同樣對於多指標評估,屬性``best_index _'',``best_score_''和``best_params_''僅在設定了``refit''後才可用,並且將通過該特定計分器確定所有屬性。
請參閱``評分''引數以瞭解有關多指標評估的更多資訊。 ..版本已更改:: 0.20。支援新增可呼叫。

 

verbose :整數。控制詳細程度:越高,訊息越多

 

error_score :「raise」或數位,預設值= np.nan。如果估算器擬合出現錯誤,則分配給分數的值。如果設定為「 raise」,則會引發錯誤。如果給出數值,則引發FitFailedWarning。此引數不會影響重新安裝步驟,這將始終引發錯誤。

 

return_train_score :布林值,預設為False。 如果為False,則cv_results_屬性將不包括訓練得分。 計算培訓分數用於瞭解不同的引數設定如何影響過擬合/欠擬合權衡。 但是,在訓練集上計算分數可能會在計算上昂貴,並且並非嚴格要求選擇產生最佳泛化效能的引數。
..版本新增:: 0.19
..版本已更改:: 0.21。 預設值從``True''更改為``False''

    Examples
    --------
    >>> from sklearn import svm, datasets
    >>> from sklearn.model_selection import GridSearchCV
    >>> iris = datasets.load_iris()
    >>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
    >>> svc = svm.SVC()
    >>> clf = GridSearchCV(svc, parameters)
    >>> clf.fit(iris.data, iris.target)
    GridSearchCV(estimator=SVC(),
    param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')})
    >>> sorted(clf.cv_results_.keys())
    ['mean_fit_time', 'mean_score_time', 'mean_test_score',...
    'param_C', 'param_kernel', 'params',...
    'rank_test_score', 'split0_test_score',...
    'split2_test_score', ...
    'std_fit_time', 'std_score_time', 'std_test_score']
 

  Attributes
    ----------
    cv_results_ : dict of numpy (masked) ndarrays.A dict with keys as column headers and values as columns, that can be imported into a pandas ``DataFrame``.
    
    For instance the below given table
    +------------+-----------+------------+-----------------+---+---------+
    |param_kernel|param_gamma|param_degree|split0_test_score|...
     |rank_t...|
     +============+===========+============+========
     =========+===+=========+
    |  'poly'    |     --    |      2     |       0.80      |...|    2    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'poly'    |     --    |      3     |       0.70      |...|    4    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.1   |     --     |       0.80      |...|    3    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.2   |     --     |       0.93      |...|    1    |
    +------------+-----------+------------+-----------------+---+---------+
        will be represented by a ``cv_results_`` dict of:: 
    {
    'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
    mask = [False False False False]...)
    'param_gamma': masked_array(data = [-- -- 0.1 0.2],
    mask = [ True  True False False]...),
    'param_degree': masked_array(data = [2.0 3.0 -- --],
    mask = [False False  True  True]...),
    'split0_test_score'  : [0.80, 0.70, 0.80, 0.93],
    'split1_test_score'  : [0.82, 0.50, 0.70, 0.78],
    'mean_test_score'    : [0.81, 0.60, 0.75, 0.85],
    'std_test_score'     : [0.01, 0.10, 0.05, 0.08],
    'rank_test_score'    : [2, 4, 3, 1],
    'split0_train_score' : [0.80, 0.92, 0.70, 0.93],
    'split1_train_score' : [0.82, 0.55, 0.70, 0.87],
    'mean_train_score'   : [0.81, 0.74, 0.70, 0.90],
    'std_train_score'    : [0.01, 0.19, 0.00, 0.03],
    'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],
    'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],
    'mean_score_time'    : [0.01, 0.06, 0.04, 0.04],
    'std_score_time'     : [0.00, 0.00, 0.00, 0.01],
    'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
    }
    
    NOTE
    
    The key ``'params'`` is used to store a list of parameter settings dicts for all the parameter candidates.
    The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and  ``std_score_time`` are all in seconds.
    For multi-metric evaluation, the scores for all the scorers are available in the ``cv_results_`` dict at the keys ending with that scorer's name (``'_<scorer_name>'``) instead of ``'_score'`` shown above. ('split0_test_precision', 'mean_train_precision' etc.)
    
    best_estimator_ : estimator. Estimator that was chosen by the search, i.e. estimator  which gave highest score (or smallest loss if specified) on the left out data. Not available if ``refit=False``.
    See ``refit`` parameter for more information on allowed values.
    
    best_score_ : float. Mean cross-validated score of the best_estimator. For multi-metric evaluation, this is present only if ``refit`` is specified. This attribute is not available if ``refit`` is a function.
    
    best_params_ : dict. Parameter setting that gave the best results on the hold out data. For multi-metric evaluation, this is present only if ``refit`` is specified.
    
    best_index_ : int. The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting. The dict at ``search.cv_results_['params'][search.best_index_]`` gives the parameter setting for the best model, that gives the highest mean score (``search.best_score_``).
    For multi-metric evaluation, this is present only if ``refit`` is specified.
    
    scorer_ : function or a dict.  Scorer function used on the held out data to choose the best parameters for the model. For multi-metric evaluation, this attribute holds the validated ``scoring`` dict which maps the scorer key to the scorer callable.
    
    n_splits_ : int. The number of cross-validation splits (folds/iterations).
    
    refit_time_ : float. Seconds used for refitting the best model on the whole dataset. This is present only if ``refit`` is not False.
       .. versionadded:: 0.20
    
    Notes
    -----
    The parameters selected are those that maximize the score of the left  out data, unless an explicit score is passed in which case it is used instead.
    If `n_jobs` was set to a value higher than one, the data is copied for  each point in the grid (and not `n_jobs` times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the dataset is large and not enough memory is available.  A  workaround in this case is to set `pre_dispatch`. Then, the memory is copied only  `pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 * n_jobs`.
    
    See Also
    ---------
    :class:`ParameterGrid`:
    generates all the combinations of a hyperparameter grid.
    
    :func:`sklearn.model_selection.train_test_split`:
    utility function to split the data into a development set usable for fitting a GridSearchCV instance and an evaluation set for  its final evaluation.
    
    :func:`sklearn.metrics.make_scorer`:
    Make a scorer from a performance metric or loss function.
    
    """

屬性
----------
cv_results_:numpy(masked)ndarrays的字典。字典可以將鍵作為列標題,將值作為列,可以將其匯入到pandas ``DataFrame''中

例如下面的表格

    +------------+-----------+------------+-----------------+---+---------+
    |param_kernel|param_gamma|param_degree|split0_test_score|...
     |rank_t...|
     +============+===========+============+========
     =========+===+=========+
    |  'poly'    |     --    |      2     |       0.80      |...|    2    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'poly'    |     --    |      3     |       0.70      |...|    4    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.1   |     --     |       0.80      |...|    3    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.2   |     --     |       0.93      |...|    1    |
    +------------+-----------+------------+-----------------+---+---------+

 

將由以下內容的「 cv_results_」字典表示:{

    'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
    mask = [False False False False]...)
    'param_gamma': masked_array(data = [-- -- 0.1 0.2],
    mask = [ True  True False False]...),
    'param_degree': masked_array(data = [2.0 3.0 -- --],
    mask = [False False  True  True]...),
    'split0_test_score'  : [0.80, 0.70, 0.80, 0.93],
    'split1_test_score'  : [0.82, 0.50, 0.70, 0.78],
    'mean_test_score'    : [0.81, 0.60, 0.75, 0.85],
    'std_test_score'     : [0.01, 0.10, 0.05, 0.08],
    'rank_test_score'    : [2, 4, 3, 1],
    'split0_train_score' : [0.80, 0.92, 0.70, 0.93],
    'split1_train_score' : [0.82, 0.55, 0.70, 0.87],
    'mean_train_score'   : [0.81, 0.74, 0.70, 0.90],
    'std_train_score'    : [0.01, 0.19, 0.00, 0.03],
    'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],
    'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],
    'mean_score_time'    : [0.01, 0.06, 0.04, 0.04],
    'std_score_time'     : [0.00, 0.00, 0.00, 0.01],
    'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
    }

 

注意

鍵``params''用於儲存所有候選引數的引數設定字典列表

``mean_fit_time'',``std_fit_time'',``mean_score_time''和``std_score_time''都以秒為單位。

對於多指標評估,所有得分者的得分都可以在「 cv_results_」 dict中以該得分者的名字(「 _ <scorer_name>」」)而不是「 _score」的鍵獲得。如上所示。 (「 split0_test_precision」,「 mean_train_precision」等)

 

best_estimator_:估算器。搜尋選擇的估算器,即在剩餘資料上給出最高分(或最小損失,如果指定)的估算器。如果``refit = False'',則不可用。

有關允許值的更多資訊,請參見「改裝」引數。

 

best_score_:浮動。 best_estimator的平均交叉驗證得分。對於多指標評估,僅在指定``refit''時才存在。如果``refit''是一個函數,則此屬性不可用。

 

best_params_:字典。引數設定可使保留資料獲得最佳結果。對於多指標評估,僅在指定``refit''時才存在。

 

best_index_:整數。與「 cv_results_」陣列的索引相對應的最佳候選引數設定。 search.cv_results _ ['params'] [search.best_index_]上的字典給出了最佳模型的引數設定,該模型給出了最高的平均得分(「 search.best_score_」)。
對於多指標評估,僅在指定``refit''時才存在。

 

scorer_:函數或字典。對保留的資料使用記分器功能,以為模型選擇最佳引數。對於多指標評估,此屬性儲存已驗證的「評分」字典,該評分將記分員鍵對映到可呼叫的記分員。

 

n_splits_:整數。交叉驗證拆分(摺疊/迭代)的數量

 

refit_time_:浮動。用於在整個資料集中重新擬合最佳模型的秒數。僅當``refit''不為False時才存在。
..版本新增:: 0.20

 

注意
-----
所選擇的引數是那些使遺留資料的分數最大化的引數,除非傳遞了顯式分數,否則將使用該顯式分數。
如果將n_jobs的值設定為大於1的值,則會為網格中的每個點複製資料(而不是n_jobs次)。如果出於效率考慮,這樣做是因為單個作業花費的時間很少,但是如果資料集很大且沒有足夠的可用記憶體,則可能會引發錯誤。這種情況下的解決方法是設定`pre_dispatch`。然後,該記憶體僅被複制一次pre_dispatch多次。 pre_dispatch的合理值是2 * n_jobs。

 

也可以看看
---------
ParameterGrid
生成超引數網格的所有組合

:func:`sklearn.model_selection.train_test_split`:
實用程式功能將資料分為可用於擬合GridSearchCV範例的開發集和用於其最終評估的評估集。

:func:`sklearn.metrics.make_scorer`:
根據績效指標或損失函數確定得分手。

「」

    _required_parameters = ["estimator", "param_grid"]
    @_deprecate_positional_args
    def __init__(self, estimator, param_grid, *, scoring=None, 
        n_jobs=None, iid='deprecated', refit=True, cv=None, 
        verbose=0, pre_dispatch='2*n_jobs', 
        error_score=np.nan, return_train_score=False):
        super().__init__(estimator=estimator, scoring=scoring, 
         n_jobs=n_jobs, iid=iid, refit=refit, cv=cv, verbose=verbose, 
         pre_dispatch=pre_dispatch, error_score=error_score, 
         return_train_score=return_train_score)
        self.param_grid = param_grid
        _check_param_grid(param_grid)
    
    def _run_search(self, evaluate_candidates):
        """Search all candidates in param_grid"""
        evaluate_candidates(ParameterGrid(self.param_grid))