My understanding of Recursive Feature Elimination Cross Validation:(sklearn.feature_selection.RFECV)您提供在整个数据集上训练的算法,并使用属性coef_feature_importances_创建特征重要性排名.现在,在包含所有特征的情况下,通过交叉验证对该算法进行判断.然后移除排名在底部的特征,并在数据集上重新训练模型,并创建新的排名,再次通过交叉验证进行判断.此过程将一直持续到除一个特征之外的所有特征都保留(或由min_features_to_select指定),并且最终 Select 的特征数量取决于产生最高CV分数的要素.(Source)

Question:每一项功能的简历分数都存储在rfecv.cv_results_["mean_test_score"]分中,我一直在try 复制这些分数时遇到了麻烦,而不是使用SCRICKIT的内置方法.

这就是我试图获得的n-1个特征的分数,其中n是特征的总数.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.feature_selection import RFECV

alg = DecisionTreeClassifier(random_state = 0)
cv_split = StratifiedKFold(5)
# train is a pandas dataframe, x_var and y_var are both lists containing variable strings
X = train[x_var]
y = np.ravel(train[y_var])

alg.fit(X, y)
lowest_ranked_feature = np.argmin(alg.feature_importances_)
x_var.pop(lowest_ranked_feature)

one_removed_feature = train[x_var]
alg.fit(one_removed_feature, y)
cv_score = cross_validate(alg, one_removed_feature, y, cv=cv_split, scoring="accuracy")
np.mean(cv_score["test_score"])

这是提供不同分数的内置方法:

rfecv = RFECV(
    estimator=alg,
    step=1,
    cv=cv_split,
    scoring="accuracy",
)

rfecv.fit(X, y)
rfecv.cv_results_["mean_test_score"][-2]

How do I get the exact scores as calculated in the inbuilt method?

我还想指出的是,我首先try 了所有n个功能,并且我的方法与 rfecv.cv_results_["mean_test_score"][-1].

推荐答案

正如本指出的,你之所以找到不同的答案,是因为你对RFECV的理解从根本上是有缺陷的.交叉验证不是在RFE中的每个步骤实施,而是在交叉验证的每个折叠中实施RFE.

您的方法当前从数据中移除一个要素,然后执行CV对其进行评分,实质上是执行不同的CV来对每个要素子集进行评分.然而,RFECV在开始时只执行一次CV,然后在移除特征和给模型within评分单个折叠之间循环.

Note: From experimentation, I have found that sklearn RFECV computes the score on a singular fold directly by using whichever evaluation metric you specify, and not another layer of cross validation or anything. In your case, you have chosen 100. So within a fold, the model will score the test split based on accuracy, and repeat this for each subset of features.

You can implement RFECV from scratch as follows:

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

def score(alg, x_train, y_train, x_test, y_test):
    """Calculate the accuracy score of the algorithm."""
    alg.fit(x_train, y_train)
    y_pred = alg.predict(x_test)
    return accuracy_score(y_test, y_pred)

def rfecv_function(alg, x_var, X, y, cv):
    """Perform RFECV and return a dictionary to store an array of test scores
    for each cv, where the array contains test scores for each number of 
    features selected 1,2,...,n (n is total number of features).
    """
    dic = {}
    # Iterate through folds
    for fold_index, (train_index, test_index) in enumerate(cv.split(X, y)):
        x_train_fold, y_train_fold = X.iloc[train_index], y.iloc[train_index]
        x_test_fold, y_test_fold = X.iloc[test_index], y.iloc[test_index]
        features = x_var.copy()
        
        # Array to store test scores for each feature subset
        scores_array = np.empty(len(x_var))
        
        # Iterate through the feature subsets
        for i in range(len(x_var)):
            # Calculate and store the scores in the array
            scores = score(alg, x_train_fold[features], y_train_fold, 
                           x_test_fold[features], y_test_fold)
            scores_array[-i-1] = scores 
            # Find and remove the lowest ranked feature
            alg.fit(x_train_fold[features], y_train_fold)
            lowest_rank = features[np.argmin(alg.feature_importances_)]
            features.remove(lowest_rank)
            
        dic[f"split{fold_index}_test_score"] = scores_array
    
    return dic

dtree = DecisionTreeClassifier(random_state = 0)
cv_split = StratifiedKFold(5, shuffle=True, random_state=0)

# Assume train is a pandas dataframe
x_var = ["var1", "var2", "var3"]
y_var = ["target_var"]
rfecv_scores = rfecv_function(dtree, x_var, train[x_var], train[y_var], cv_split)

rfecv_scores提供的值与内置类RFECV.cv_results_中计算的值相同

Python相关问答推荐

我从带有langchain的mongoDB中的vector serch获得一个空数组

试图找到Python方法来部分填充numpy数组

带条件计算最小值

追溯(最近最后一次调用):文件C:\Users\Diplom/PycharmProject\Yolo01\Roboflow-4.py,第4行,在模块导入roboflow中

如何过滤包含2个指定子字符串的收件箱列名?

用NumPy优化a[i] = a[i-1]*b[i] + c[i]的迭代计算

OR—Tools CP SAT条件约束

创建可序列化数据模型的最佳方法

当点击tkinter菜单而不是菜单选项时,如何执行命令?

判断solve_ivp中的事件

Matplotlib中的字体权重

如何杀死一个进程,我的Python可执行文件以sudo启动?

python中csv. Dictreader. fieldname的类型是什么?'

剪切间隔以添加特定日期

合并与拼接并举

找到相对于列表索引的当前最大值列表""

使用Openpyxl从Excel中的折线图更改图表样式

Gekko中基于时间的间隔约束

语法错误:文档. evaluate:表达式不是合法表达式

解析CSV文件以将详细信息添加到XML文件