我正在训练一个二进制分类器,我想知道它在测试集上的性能的AUC值.我认为有两种类似的方法可以做到这一点:1)我将测试集输入到参数eval_set
中,然后我在model.evals_result()
中的每一轮助推中收到相应的AUC值;2)在模型训练之后,我对测试集进行预测,然后计算该预测的AUC.我曾认为这些方法应该产生类似的值,但后一种方法(计算预测的AUC)始终产生低得多的值.你能帮我了解一下发生了什么事吗?我一定是误解了eval_set
的功能.
以下是使用Kaggle数据集(可提供here个)的完全可重现的示例:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import RocCurveDisplay, roc_curve, auc
from xgboost import XGBClassifier # xgboost version 1.7.6
import matplotlib.pyplot as plt
# Data available on kaggle here https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009/
data = pd.read_csv('winequality-red.csv')
data.head()
# Separate targets
X = data.drop('quality', axis=1)
y = data['quality'].map(lambda x: 1 if x >= 7 else 0) # wine quality >7 is good, rest is not good
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create model and fit
params = {
'eval_metric':'auc',
'objective':'binary:logistic'
}
model = XGBClassifier(**params)
model.fit(
X_train,
y_train,
eval_set=[(X_test, y_test)]
)
首先,我要可视化判断eval_set中提供的测试集所产生的AUC指标:
results = model.evals_result()
plt.plot(np.arange(0,100),results['validation_0']['auc'])
plt.title("AUC from eval_set")
plt.xlabel("Estimator (boosting round)")
plt.ylabel("AUC")
接下来,我对相同的测试集进行预测,获得AUC,并可视化ROC曲线:
test_predictions = model.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_true=y_test, y_score=test_predictions,pos_label=1)
roc_auc = auc(fpr, tpr)
display = RocCurveDisplay(roc_auc=roc_auc, fpr=fpr, tpr=tpr)
display.plot()
如您所见,预测的AUC值为0.81,低于通过判断相同测试集得出的eval_set
中的任何AUC.我怎么会误解了这两种方法呢?谢谢,XGBoost对我来说还是个 fresh 事,我很感谢你的建议.