# Python SCRIT•；S RFECV班级如何计算cv_成绩_

My understanding of Recursive Feature Elimination Cross Validation:(sklearn.feature_selection.RFECV)您提供在整个数据集上训练的算法，并使用属性coef_feature_importances_创建特征重要性排名.现在，在包含所有特征的情况下，通过交叉验证对该算法进行判断.然后移除排名在底部的特征，并在数据集上重新训练模型，并创建新的排名，再次通过交叉验证进行判断.此过程将一直持续到除一个特征之外的所有特征都保留(或由min_features_to_select指定)，并且最终 Select 的特征数量取决于产生最高CV分数的要素.(Source)

Question:每一项功能的简历分数都存储在rfecv.cv_results_["mean_test_score"]分中，我一直在try 复制这些分数时遇到了麻烦，而不是使用SCRICKIT的内置方法.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.feature_selection import RFECV

alg = DecisionTreeClassifier(random_state = 0)
cv_split = StratifiedKFold(5)
# train is a pandas dataframe, x_var and y_var are both lists containing variable strings
X = train[x_var]
y = np.ravel(train[y_var])

alg.fit(X, y)
lowest_ranked_feature = np.argmin(alg.feature_importances_)
x_var.pop(lowest_ranked_feature)

one_removed_feature = train[x_var]
alg.fit(one_removed_feature, y)
cv_score = cross_validate(alg, one_removed_feature, y, cv=cv_split, scoring="accuracy")
np.mean(cv_score["test_score"])


rfecv = RFECV(
estimator=alg,
step=1,
cv=cv_split,
scoring="accuracy",
)

rfecv.fit(X, y)
rfecv.cv_results_["mean_test_score"][-2]


How do I get the exact scores as calculated in the inbuilt method?

## 推荐答案

Note: From experimentation, I have found that sklearn RFECV computes the score on a singular fold directly by using whichever evaluation metric you specify, and not another layer of cross validation or anything. In your case, you have chosen 100. So within a fold, the model will score the test split based on accuracy, and repeat this for each subset of features.

You can implement RFECV from scratch as follows:

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

def score(alg, x_train, y_train, x_test, y_test):
"""Calculate the accuracy score of the algorithm."""
alg.fit(x_train, y_train)
y_pred = alg.predict(x_test)
return accuracy_score(y_test, y_pred)

def rfecv_function(alg, x_var, X, y, cv):
"""Perform RFECV and return a dictionary to store an array of test scores
for each cv, where the array contains test scores for each number of
features selected 1,2,...,n (n is total number of features).
"""
dic = {}
# Iterate through folds
for fold_index, (train_index, test_index) in enumerate(cv.split(X, y)):
x_train_fold, y_train_fold = X.iloc[train_index], y.iloc[train_index]
x_test_fold, y_test_fold = X.iloc[test_index], y.iloc[test_index]
features = x_var.copy()

# Array to store test scores for each feature subset
scores_array = np.empty(len(x_var))

# Iterate through the feature subsets
for i in range(len(x_var)):
# Calculate and store the scores in the array
scores = score(alg, x_train_fold[features], y_train_fold,
x_test_fold[features], y_test_fold)
scores_array[-i-1] = scores
# Find and remove the lowest ranked feature
alg.fit(x_train_fold[features], y_train_fold)
lowest_rank = features[np.argmin(alg.feature_importances_)]
features.remove(lowest_rank)

dic[f"split{fold_index}_test_score"] = scores_array

return dic

dtree = DecisionTreeClassifier(random_state = 0)
cv_split = StratifiedKFold(5, shuffle=True, random_state=0)

# Assume train is a pandas dataframe
x_var = ["var1", "var2", "var3"]
y_var = ["target_var"]
rfecv_scores = rfecv_function(dtree, x_var, train[x_var], train[y_var], cv_split)


rfecv_scores提供的值与内置类RFECV.cv_results_中计算的值相同