我被sklearn的permutation_importance函数搞糊涂了.我用正则化逻辑回归拟合了一条管道,得到了several feature coefficients being 0.然而,当我想在测试数据集some of these features get non-zero importance values上计算特征的排列重要性时.

下面是一些示例代码&数据:

import numpy as np    
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance


# create example data with missings
X, y = make_classification(n_samples = 500,
                           n_features = 100,
                           n_informative = 25,
                           n_redundant = 75,
                           random_state = 0)
c = 10000 # number of missings
X.ravel()[np.random.choice(X.size, c, replace = False)] = np.nan # introduce random missings
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.2, random_state = 0)

folds = 5
repeats = 5
n_iter = 25
rskfold = RepeatedStratifiedKFold(n_splits = folds, n_repeats = repeats, random_state = 1897)

scl = StandardScaler()
imp = KNNImputer(n_neighbors = 5, weights = 'uniform')
sgdc = SGDClassifier(loss = 'log', penalty = 'elasticnet', class_weight = 'balanced', random_state = 0)

pipe = Pipeline([('scaler', scl),
                 ('imputer', imp),
                 ('clf', sgdc)])
param_rand = {'clf__l1_ratio': stats.uniform(0, 1),
              'clf__alpha': loguniform(0.001, 1)}

m = RandomizedSearchCV(pipe, param_rand, n_iter = n_iter, cv = rskfold, scoring = 'accuracy', random_state = 0, verbose = 1, n_jobs = -1)
m.fit(Xtrain, ytrain)

coefs = m.best_estimator_.steps[2][1].coef_
print('Number of non-zero feature coefficients in classifier:')
print(np.sum(coefs != 0))

imps = permutation_importance(m, Xtest, ytest, n_repeats = 25, random_state = 0, n_jobs = -1)

print('Number of non-zero feature importances after permutations:')
print(np.sum(imps['importances_mean'] != 0))

您将看到第二个打印的数字与第一个不匹配...

非常感谢您的帮助!

推荐答案

因为你有KNNImputer分.模型中系数为零的特征仍然会影响其他列的插补,因此在置换时可能会改变整个管道的预测,因此具有非零置换重要性.

Python相关问答推荐

大Pandas 胚胎中产生组合

使用numpy提取数据块

对某些列的总数进行民意调查,但不单独列出每列

为什么这个带有List输入的简单numba函数这么慢

Python虚拟环境的轻量级使用

基于索引值的Pandas DataFrame条件填充

在Python argparse包中添加formatter_class MetavarTypeHelpFormatter时, - help不再工作""""

形状弃用警告与组合多边形和多边形如何解决

不能使用Gekko方程'

Python Pandas获取层次路径直到顶层管理

Pandas Data Wrangling/Dataframe Assignment

OpenGL仅渲染第二个三角形,第一个三角形不可见

一个telegram 机器人应该发送一个测验如何做?""

Python协议不兼容警告

如果不使用. to_list()[0],我如何从一个pandas DataFrame中获取一个值?

如何从数据框列中提取特定部分并将该值填充到其他列中?

在round函数中使用列值

如何在Python中解析特定的文本,这些文本包含了同一行中的所有内容,

一维不匹配两个数组上的广义ufunc

如何将django url参数传递给模板&S url方法?