我想在sklearn管道中使用几种特征 Select 方法,如下所示:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
我想获取所选功能的名称或列索引.关键是,第二个特征 Select 步骤获得第一个特征 Select 步骤的输出(不是原始X_序列).因此,当我在第二个特征 Select 步骤中使用get_support()
或get_feature_names_out()
等方法时,特征名称或索引与原始输入特征不匹配.
vt = model['vt']
vt.get_feature_names_out()
vt.get_support()
kbest = model['kbest']
kbest.get_feature_names_out()
kbest.get_support()
例如,当我运行vt.get_support()
时,我得到一个具有30个实体的布尔array.但是,当我运行kbest.get_support()
时,我得到了一个具有14个实体的布尔array.这意味着输入到第二个特征 Select 方法的数据的名称或列索引被重置,并且与输入到第一个特征 Select 方法的数据不匹配.
如何解决这个问题?