我有一个问题,我想使用管道(OHE作为预处理和简单的线性回归作为模型)与SHAP工具.

至于数据,以下是我的数据(我使用的是我修改后的共享单车数据集):

bike_data=pd.read_csv("bike_outlier_clean.csv")

bike_data['season']=bike_data.season.astype('category')
bike_data['year']=bike_data.year.astype('category')
bike_data['holiday']=bike_data.holiday.astype('category')
bike_data['workingday']=bike_data.workingday.astype('category')
bike_data['weather_condition']=bike_data.weather_condition.astype('category')

bike_data['season'] = bike_data['season'].map({1:'Spring', 2:'Summer', 3:'Fall', 4: 'Winter'})
bike_data['year'] = bike_data['year'].map({0: 2011, 1: 2012})
bike_data['holiday'] = bike_data['holiday'].map({0: False, 1: True})
bike_data['workingday'] = bike_data['workingday'].map({0: False, 1: True})
bike_data['weather_condition'] = bike_data['weather_condition'].map({1:'Clear', 2:'Mist', 3:'Light Snow/Rain', 4: 'Heavy Snow/Rain'})

bike_data = bike_data[['total_count','season','month','year','weekday','holiday','workingday','weather_condition','humidity','temp','windspeed']]

x = bike_data.drop('total_count', axis=1)
y = bike_data['total_count']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

而对于我的管道

category_columns = list(set(bike_data.columns) - set(bike_data._get_numeric_data().columns))
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), category_columns)
    ],
    remainder='passthrough'  
)
model = LinearRegression()

pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

pipeline.fit(x_train,y_train)

最后,使用kernelSHAP解释器

explainer = shap.KernelExplainer(pipeline.predict, shap.sample(x, 5))

然而,这就是错误发生的地方.

    123             # Make a copy so that the feature names are not removed from the original model
    124             out = copy.deepcopy(out)
--> 125             out.f.__self__.feature_names_in_ = None
    126 
    127     return out

AttributeError: can't set attribute 'feature_names_in_'

我现在完全不知道该怎么做才能解决这个问题.

推荐答案

Shap在使用Pipeline对象时表现不佳,因此我建议如下(当我开始使用numpy数组而不是Pandas df时要小心):

import shap

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

print(shap.__version__)

bike_data = pd.read_csv("archive/bike_sharing_daily.csv")
bike_data['season']=bike_data.season.astype('category')
bike_data['holiday']=bike_data.holiday.astype('category')
bike_data['workingday']=bike_data.workingday.astype('category')
bike_data['season'] = bike_data['season'].map({1:'Spring', 2:'Summer', 3:'Fall', 4: 'Winter'})
bike_data['holiday'] = bike_data['holiday'].map({0: False, 1: True})
bike_data['workingday'] = bike_data['workingday'].map({0: False, 1: True})
bike_data = bike_data[['season','weekday','holiday','workingday','temp','windspeed']]

x = bike_data

y = np.random.randint(0, 10, len(bike_data))

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

category_columns = list(set(bike_data.columns) - set(bike_data._get_numeric_data().columns))

col_idx = [i for i, col in enumerate(x_train.columns) if col in category_columns]

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), col_idx)
    ],
    remainder='passthrough'  
)
model = LinearRegression()

pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

pipeline.fit(x_train.values, y_train) # <-- from here

explainer = shap.KernelExplainer(pipeline.predict, x_train.values[:10])
sv = explainer.shap_values(x_train.values)

shap.summary_plot(sv, x.columns) # <-- add column names back

0.44.1.dev4

enter image description here

Python相关问答推荐

当跨文件且参数化时,Pytest依赖项不起作用

计算每月过go x年的平均值

了解shuffle在NP.random.Generator.choice()中的作用

在Arrow上迭代的快速方法.Julia中包含3000万行和25列的表

遵循轮廓中对象方向的计算线

如何从FDaGrid实例中删除某些函数?

替换字符串中的多个重叠子字符串

Python中的嵌套Ruby哈希

Vectorize多个头寸的止盈/止盈回溯测试pythonpandas

什么相当于pytorch中的numpy累积ufunc

pandas滚动和窗口中有效观察的最大数量

如何在solve()之后获得症状上的等式的值

在Python argparse包中添加formatter_class MetavarTypeHelpFormatter时, - help不再工作""""

利用Selenium和Beautiful Soup实现Web抓取JavaScript表

为一个组的每个子组绘制,

joblib:无法从父目录的另一个子文件夹加载转储模型

启用/禁用shiny 的自动重新加载

用渐近模计算含符号的矩阵乘法

Maya Python脚本将纹理应用于所有对象,而不是选定对象

Polars将相同的自定义函数应用于组中的多个列,