我正在try 执行嵌套的交叉验证,同时使用GroupShuffleSplit类合并基于组的拆分.然而,在try 使用具有GridSearchCV的自定义交叉验证对象时,我遇到了"TypeError:Cannot Pickle‘Generator’Object".据我所知,发生这个错误是因为group_split.split(...)返回了一个不能在cross_val_score函数中使用的生成器.因此,我想问一下,是否有一种方法可以轻松地使用GroupShuffleSplit进行嵌套交叉验证.

关于我的简化示例代码: 我有一个包含要素X、标签y和组标签groups的数据集.目标是执行嵌套的交叉验证,其中内循环和外循环都根据组标签拆分数据.我想使用GridSearchCV来进行超参数调优,使用cross_val_score来判断性能.

import numpy as np
from sklearn.model_selection import GroupShuffleSplit, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier

X = np.random.rand(100, 10)
y = np.random.randint(2, size=100)
groups = np.random.randint(4, size=100)  # Example group labels

rf_classifier = RandomForestClassifier()
param_grid = {'n_estimators': [50, 100, 200]}

inner_cv = GroupShuffleSplit(n_splits=5, test_size=0.2)
outer_cv = GroupShuffleSplit(n_splits=5, test_size=0.2)

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=inner_cv.split(X, y, groups=groups))
nested_scores = cross_val_score(estimator=grid_search, X=X, y=y, cv=outer_cv.split(X, y, groups=groups))

导致以下堆栈跟踪错误:

---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
File c:\Anaconda3_x64\lib\site-packages\joblib\parallel.py:825, in Parallel.dispatch_one_batch(self, iterator)
    824 try:
--> 825     tasks = self._ready_batches.get(block=False)
    826 except queue.Empty:
    827     # slice the iterator n_jobs * batchsize items at a time. If the
    828     # slice returns less than that, then the current batchsize puts
   (...)
    831     # accordingly to distribute evenly the last items between all
    832     # workers.

File c:\Anaconda3_x64\lib\queue.py:168, in Queue.get(self, block, timeout)
    167     if not self._qsize():
--> 168         raise Empty
    169 elif timeout is None:

Empty: 

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[29], line 16
     13 outer_cv = GroupShuffleSplit(n_splits=5, test_size=0.2)
     15 grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=inner_cv.split(X, y, groups=groups))
---> 16 nested_scores = cross_val_score(estimator=grid_search, X=X, y=y, cv=outer_cv.split(X, y, groups=groups))
     18 print(nested_scores)

File c:\Anaconda3_x64\lib\site-packages\sklearn\model_selection\_validation.py:515, in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    512 # To ensure multimetric format is not supported
    513 scorer = check_scoring(estimator, scoring=scoring)
--> 515 cv_results = cross_validate(
    516     estimator=estimator,
    517     X=X,
    518     y=y,
    519     groups=groups,
    520     scoring={"score": scorer},
    521     cv=cv,
    522     n_jobs=n_jobs,
    523     verbose=verbose,
    524     fit_params=fit_params,
    525     pre_dispatch=pre_dispatch,
    526     error_score=error_score,
    527 )
    528 return cv_results["test_score"]

File c:\Anaconda3_x64\lib\site-packages\sklearn\model_selection\_validation.py:266, in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    263 # We clone the estimator to make sure that all the folds are
    264 # independent, and that it is pickle-able.
    265 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 266 results = parallel(
    267     delayed(_fit_and_score)(
    268         clone(estimator),
    269         X,
    270         y,
    271         scorers,
    272         train,
    273         test,
    274         verbose,
    275         None,
    276         fit_params,
    277         return_train_score=return_train_score,
    278         return_times=True,
    279         return_estimator=return_estimator,
    280         error_score=error_score,
    281     )
    282     for train, test in cv.split(X, y, groups)
    283 )
    285 _warn_or_raise_about_fit_failures(results, error_score)
    287 # For callabe scoring, the return type is only know after calling. If the
    288 # return type is a dictionary, the error scores can now be inserted with
    289 # the correct key.

File c:\Anaconda3_x64\lib\site-packages\sklearn\utils\parallel.py:63, in Parallel.__call__(self, iterable)
     58 config = get_config()
     59 iterable_with_config = (
     60     (_with_config(delayed_func, config), args, kwargs)
     61     for delayed_func, args, kwargs in iterable
     62 )
---> 63 return super().__call__(iterable_with_config)

File c:\Anaconda3_x64\lib\site-packages\joblib\parallel.py:1048, in Parallel.__call__(self, iterable)
   1039 try:
   1040     # Only set self._iterating to True if at least a batch
   1041     # was dispatched. In particular this covers the edge
   (...)
   1045     # was very quick and its callback already dispatched all the
   1046     # remaining jobs.
   1047     self._iterating = False
-> 1048     if self.dispatch_one_batch(iterator):
   1049         self._iterating = self._original_iterator is not None
   1051     while self.dispatch_one_batch(iterator):

File c:\Anaconda3_x64\lib\site-packages\joblib\parallel.py:836, in Parallel.dispatch_one_batch(self, iterator)
    833 n_jobs = self._cached_effective_n_jobs
    834 big_batch_size = batch_size * n_jobs
--> 836 islice = list(itertools.islice(iterator, big_batch_size))
    837 if len(islice) == 0:
    838     return False

File c:\Anaconda3_x64\lib\site-packages\sklearn\utils\parallel.py:59, in <genexpr>(.0)
     54 # Capture the thread-local scikit-learn configuration at the time
     55 # Parallel.__call__ is issued since the tasks can be dispatched
     56 # in a different thread depending on the backend and on the value of
     57 # pre_dispatch and n_jobs.
     58 config = get_config()
---> 59 iterable_with_config = (
     60     (_with_config(delayed_func, config), args, kwargs)
     61     for delayed_func, args, kwargs in iterable
     62 )
     63 return super().__call__(iterable_with_config)

File c:\Anaconda3_x64\lib\site-packages\sklearn\model_selection\_validation.py:268, in <genexpr>(.0)
    263 # We clone the estimator to make sure that all the folds are
    264 # independent, and that it is pickle-able.
    265 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
    266 results = parallel(
    267     delayed(_fit_and_score)(
--> 268         clone(estimator),
    269         X,
    270         y,
    271         scorers,
    272         train,
    273         test,
    274         verbose,
    275         None,
    276         fit_params,
    277         return_train_score=return_train_score,
    278         return_times=True,
    279         return_estimator=return_estimator,
    280         error_score=error_score,
    281     )
    282     for train, test in cv.split(X, y, groups)
    283 )
    285 _warn_or_raise_about_fit_failures(results, error_score)
    287 # For callabe scoring, the return type is only know after calling. If the
    288 # return type is a dictionary, the error scores can now be inserted with
    289 # the correct key.

File c:\Anaconda3_x64\lib\site-packages\sklearn\base.py:89, in clone(estimator, safe)
     87 new_object_params = estimator.get_params(deep=False)
     88 for name, param in new_object_params.items():
---> 89     new_object_params[name] = clone(param, safe=False)
     90 new_object = klass(**new_object_params)
     91 params_set = new_object.get_params(deep=False)

File c:\Anaconda3_x64\lib\site-packages\sklearn\base.py:70, in clone(estimator, safe)
     68 elif not hasattr(estimator, "get_params") or isinstance(estimator, type):
     69     if not safe:
---> 70         return copy.deepcopy(estimator)
     71     else:
     72         if isinstance(estimator, type):

File c:\Anaconda3_x64\lib\copy.py:161, in deepcopy(x, memo, _nil)
    159 reductor = getattr(x, "__reduce_ex__", None)
    160 if reductor is not None:
--> 161     rv = reductor(4)
    162 else:
    163     reductor = getattr(x, "__reduce__", None)

TypeError: cannot pickle 'generator' object

推荐答案

在1.3版之前,如果不编写手动循环来替换cross_val_score,我不确定这是否可能.除了生成器问题,您还试图告诉网格搜索对象它应该拆分X的全部内容,但它不会看到X的全部内容(它已经被外部拆分器拆分).

在1.3中,我们得到metadata routing,它会自动将groups路由到组拆分器.然后我们可以这样做.

from sklearn import set_config
set_config(enable_metadata_routing=True)

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=inner_cv)
nested_scores = cross_val_score(estimator=grid_search, X=X, y=y, cv=outer_cv, params={'groups': groups})

只是为了判断这是否真的路由到两个拆分器,这里是脚本的修改版本:

import numpy as np
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit, GridSearchCV, cross_val_score
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn import set_config

set_config(enable_metadata_routing=True)

X = np.random.rand(100, 10)
y = np.random.randint(2, size=100)
groups = np.random.randint(4, size=100)  # Example group labels

X = pd.DataFrame(X)


class MyClassifier(ClassifierMixin, BaseEstimator):
    def __init__(self, n_estimators=1):
        self.n_estimators = n_estimators

    def fit(self, X, y):
        print("train: ", groups[X.index])
        return self
    
    def predict(self, X):
        print("test: ", groups[X.index])
        return np.random.randint(2, size=len(X))


rf_classifier = MyClassifier()
param_grid = {'n_estimators': [50, 100]}

inner_cv = GroupShuffleSplit(n_splits=2, test_size=0.33)
outer_cv = GroupShuffleSplit(n_splits=2, test_size=0.25)

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=inner_cv, verbose=10)
nested_scores = cross_val_score(estimator=grid_search, X=X, y=y, cv=outer_cv, params={'groups': groups}, verbose=10)

print(nested_score)

外部拆分将一个组放入测试集中,然后内部拆分从剩余的三个组中 Select 一个作为测试,最后两个作为测试.以下是我的输出:

[CV] START .....................................................................
Fitting 2 folds for each of 2 candidates, totalling 4 fits
[CV 1/2; 1/2] START n_estimators=50.............................................
train:  [1 3 3 3 1 3 3 1 1 1 1 1 3 1 1 1 3 3 1 3 3 3 3 1 1 1 3 3 3 3 3 3 3 3 3 1 3
 3 3 3 1 3 1 1 1 3 3 1 1 3 1 1 1 1 1 1]
test:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[CV 1/2; 1/2] END ..............n_estimators=50;, score=0.353 total time=   0.0s
[CV 2/2; 1/2] START n_estimators=50.............................................
train:  [3 3 3 3 3 0 0 3 3 3 3 3 3 0 3 0 0 0 3 3 0 0 0 3 3 0 0 3 3 3 3 3 3 3 3 3 0
 0 0 0 3 3 3 0 0 3]
test:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[CV 2/2; 1/2] END ..............n_estimators=50;, score=0.407 total time=   0.0s
[CV 1/2; 2/2] START n_estimators=100............................................
train:  [1 3 3 3 1 3 3 1 1 1 1 1 3 1 1 1 3 3 1 3 3 3 3 1 1 1 3 3 3 3 3 3 3 3 3 1 3
 3 3 3 1 3 1 1 1 3 3 1 1 3 1 1 1 1 1 1]
test:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[CV 1/2; 2/2] END .............n_estimators=100;, score=0.412 total time=   0.0s
[CV 2/2; 2/2] START n_estimators=100............................................
train:  [3 3 3 3 3 0 0 3 3 3 3 3 3 0 3 0 0 0 3 3 0 0 0 3 3 0 0 3 3 3 3 3 3 3 3 3 0
 0 0 0 3 3 3 0 0 3]
test:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[CV 2/2; 2/2] END .............n_estimators=100;, score=0.333 total time=   0.0s
train:  [1 3 3 3 1 3 3 1 1 1 0 1 1 0 3 1 1 1 3 3 1 3 3 3 0 3 1 1 0 1 0 0 3 3 0 0 0
 3 3 0 0 3 3 3 3 3 1 3 3 3 3 0 0 1 0 0 3 1 1 1 3 3 1 1 0 0 3 1 1 1 1 1 1]
test:  [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
[CV] END ................................ score: (test=0.481) total time=   0.0s
[CV] START .....................................................................
Fitting 2 folds for each of 2 candidates, totalling 4 fits
[CV 1/2; 1/2] START n_estimators=50.............................................
train:  [3 3 3 3 2 3 2 2 2 3 2 3 2 3 2 2 2 3 3 2 3 3 2 2 2 3 3 2 2 3 3 2 2 2 3 3 3
 3 3 3 3 2 3 3 2 2 3 2 2 2 2 3 3 2 3 2]
test:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[CV 1/2; 1/2] END ..............n_estimators=50;, score=0.588 total time=   0.0s
[CV 2/2; 1/2] START n_estimators=50.............................................
train:  [3 3 3 3 2 3 2 2 2 3 2 3 2 3 2 2 2 3 3 2 3 3 2 2 2 3 3 2 2 3 3 2 2 2 3 3 3
 3 3 3 3 2 3 3 2 2 3 2 2 2 2 3 3 2 3 2]
test:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[CV 2/2; 1/2] END ..............n_estimators=50;, score=0.588 total time=   0.0s
[CV 1/2; 2/2] START n_estimators=100............................................
train:  [3 3 3 3 2 3 2 2 2 3 2 3 2 3 2 2 2 3 3 2 3 3 2 2 2 3 3 2 2 3 3 2 2 2 3 3 3
 3 3 3 3 2 3 3 2 2 3 2 2 2 2 3 3 2 3 2]
test:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[CV 1/2; 2/2] END .............n_estimators=100;, score=0.647 total time=   0.0s
[CV 2/2; 2/2] START n_estimators=100............................................
train:  [3 3 3 3 2 3 2 2 2 3 2 3 2 3 2 2 2 3 3 2 3 3 2 2 2 3 3 2 2 3 3 2 2 2 3 3 3
 3 3 3 3 2 3 3 2 2 3 2 2 2 2 3 3 2 3 2]
test:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[CV 2/2; 2/2] END .............n_estimators=100;, score=0.471 total time=   0.0s
train:  [3 3 3 3 2 3 2 2 0 2 0 3 2 3 2 3 2 2 2 3 3 2 3 0 3 2 2 2 0 0 0 3 3 2 0 0 2
 0 3 3 0 0 2 2 2 3 3 3 3 3 3 3 2 3 3 2 2 0 0 0 0 3 2 2 2 2 3 3 2 0 0 3 2]
test:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[CV] END ................................ score: (test=0.593) total time=   0.0s
[0.48148148 0.59259259]
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.0s

Python相关问答推荐

为什么tkinter框架没有被隐藏?

类型错误:输入类型不支持ufuncisnan-在执行Mann-Whitney U测试时[SOLVED]

PMMLPipeline._ fit()需要2到3个位置参数,但给出了4个位置参数

为什么以这种方式调用pd.ExcelWriter会创建无效的文件格式或扩展名?

如何调整QscrollArea以正确显示内部正在变化的Qgridlayout?

如何使用scipy的curve_fit与约束,其中拟合的曲线总是在观测值之下?

如何在图中标记平均点?

当点击tkinter菜单而不是菜单选项时,如何执行命令?

如果初始groupby找不到满足掩码条件的第一行,我如何更改groupby列,以找到它?

Flash只从html表单中获取一个值

如何排除prefecture_related中查询集为空的实例?

PYTHON、VLC、RTSP.屏幕截图不起作用

pandas fill和bfill基于另一列中的条件

如何删除重复的文字翻拍?

如何在GEKKO中使用复共轭物

用来自另一个数据框的列特定标量划分Polars数据框中的每一列,

上传文件并使用Panda打开时的Flask 问题

如何获取给定列中包含特定值的行号?

我如何为测试函数的参数化提供fixture 生成的数据?如果我可以的话,还有其他 Select 吗?

了解如何让库认识到我具有所需的依赖项