是否可以更改决策树分类器的阈值?我正在研究精确度和召回率之间的权衡,并希望改变门槛以支持召回.我正在研究ML,但在那里它使用了SGD分类器,在某一点上它使用了带有方法="Decision_Function"属性的cross_val_predict(),但这不存在于DecisionTreeclassator.我用的是管道和交叉验证. 我的研究使用的是这个数据集: https://www.kaggle.com/datasets/imnikhilanand/heart-attack-prediction

features = df_heart.drop(['output'], axis=1).copy()
labels = df_heart.output

#split
X_train, X_test, y_train, y_test= train_test_split(features, labels,
                                train_size=0.7,
                                random_state=42,
                                stratify=features["sex"]
                               )
# categorical features
cat = ['sex', 'tipo_de_dor', 'ang_indz_exerc', 'num_vasos', 'acuc_sang_jejum', 'eletrc_desc', 'pico_ST_exerc', 'talassemia']

# treatment of categorical variables
t = [('cat', OneHotEncoder(handle_unknown='ignore'), cat)]

preprocessor = ColumnTransformer(transformers=t, remainder='passthrough')

#pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('clf', DecisionTreeClassifier(min_samples_leaf=8, random_state=42),)
                       ]
                )

pipe.fit(X_train, y_train)

valid_cruz_strat = StratifiedKFold(n_splits=14, shuffle=True, random_state=42)

y_train_pred = cross_val_predict(pipe['clf'], X_train, y_train, cv=valid_cruz_strat)

conf_mat = confusion_matrix(y_train, y_train_pred)


ConfusionMatrixDisplay(confusion_matrix=conf_mat, 
                       display_labels=pipe['clf'].classes_).plot()
plt.grid(False)
plt.show()

confusion matrix

threshold = 0 #this is only for support the graph
idx = (thresholds >= threshold).argmax()  # first index ≥ threshold

plt.plot(thresholds, precisions[:-1], 'b--', label = 'Precisão')
plt.plot(thresholds, recalls[:-1], 'g-', label = 'Recall')
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")
plt.title('Precisão x Recall', fontsize = 14)

plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-.5, 1.5, 0, 1.1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="lower left")

plt.show()

Precision/Recall - threshold

valid_cruz_strat = StratifiedKFold(n_splits=14, shuffle=True, random_state=42)

y_score = cross_val_predict(pipe['clf'], X_train, y_train, cv=valid_cruz_strat)

precisions, recalls, thresholds = precision_recall_curve(y_train, y_score)

threshold = 0.75 #this is only for support the graph
idx = (thresholds >= threshold).argmax() 

plt.figure(figsize=(6, 5))  

plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall curve")

plt.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
plt.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
plt.plot([recalls[idx]], [precisions[idx]], "ko",
         label="Point at threshold "+str(threshold))

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0, 1, 0, 1])
plt.grid()
plt.legend(loc="lower left")

plt.show()

Precision x Recall

当我判断由precision_recall_curve()函数生成的数组时,我发现它只包含3个元素.这是正确的行为吗?例如,当我像本书中那样,对SGDategator执行cross_val_predict()函数时,没有使用方法=‘Decision_Function’属性,我使用precision_recall_curve()中的输出,它生成带有3个元素的数组,如果我使用方法=‘Decision_Function’,它将生成带有几个元素的array.

我的主要问题是如何 Select DecisionTree分类器的阈值,如果有一种方法可以生成具有几个点的Precision x Recall曲线,我只能用这三个点来管理,我无法理解如何提高召回率.

移动阈值以提高召回率,并了解如何使用决策树分类器进行此操作

推荐答案

这个主题通常被命名为"model calibration".scikit-learn支持几种类型的probability calibration,这也可能是阅读的有用信息.

DecisionTreeClassifier中"更改阈值"的一种方法是调用.predict_proba(X)并在可能的阈值上观察一个(多个)指标:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
import numpy as np
import matplotlib.pyplot as plt

X, y = make_classification(n_samples=10000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)

prob_pred = clf.predict_proba(X_test)[:, 1]

thresholds = np.arange(0.0, 1.0, step=0.01)
recall_scores = [recall_score(y_test, prob_pred > t) for t in thresholds]
precis_scores = [precision_score(y_test, prob_pred > t) for t in thresholds]

现在我们有了一组介于0.01.0之间的阈值,我们已经计算了每个阈值的精确度和召回率(Side note:这个问题对于多标签或多类别预测定义不太明确-通常这些指标是每个类别或类似类别的平均值).

然后我们将绘制:

fig, ax = plt.subplots(1, 1)
ax.plot(thresholds, recall_scores, label="Recall @ t")
ax.plot(thresholds, precis_scores, label="Precision @ t")
ax.axvline(0.5, c="gray", linestyle="--", label="Default Threshold")
ax.set_xlabel("Threshold")
ax.set_ylabel("Metric @ Threshold")
ax.set_box_aspect(1)
ax.legend()
plt.show()

其结果是如下所示:

A line plot showing thresholds from 0 to 1 on the x axis and recall on the y axis. The line is parabolic, with high recall for low thresholds and low recall for high thresholds.

这表明,.predict()0.5使用的默认阈值可能不是在所有情况下都是最好的.事实上,有一系列阈值的准确率和召回率接近fairly,但两者都更有利.在这种情况下:略微降低阈值将倾向于召回,而增加阈值将倾向于精确度.

在实践中:问题的合适阈值归结为领域知识,因为在精确度和查全率之间总是需要权衡的.

Python相关问答推荐

使用SciPy进行曲线匹配未能给出正确的匹配

数据抓取失败:寻求帮助

如何在Python脚本中附加一个Google tab(已经打开)

梯度下降:简化要素集的运行时间比原始要素集长

使用密钥字典重新配置嵌套字典密钥名

用渐近模计算含符号的矩阵乘法

使用特定值作为引用替换数据框行上的值

* 动态地 * 修饰Python中的递归函数

在Admin中显示从ManyToMany通过模型的筛选结果

计算空值

在Google Drive中获取特定文件夹内的FolderID和文件夹名称

使用类型提示进行类型转换

如何将泛型类类型与函数返回类型结合使用?

有没有办法让Re.Sub报告它所做的每一次替换?

按条件添加小计列

Pandas:计数器的滚动和,复位

我如何处理超类和子类的情况

如何在PYTHON中向单元测试S Side_Effect发送额外参数?

迭代工具组合不会输出大于3的序列

如何使用count()获取特定日期之间的项目