我正在使用两个不同的数据集进行文本分类,目的是使用一个数据集进行训练,另一个用于测试.请注意,我不希望为了防止泄漏而合并数据集(我认为这就是它的名称).与训练数据集(16k行)相比,测试数据集(~1000行)要小得多
我使用的是CountVectorizer,由于两个数据集具有不同的词汇表,因此会产生不同的列数--这会在预测步骤中导致错误.
ValueError: X has 55229 features, but DecisionTreeClassifier is expecting 387964
features as input.
我在GPT和谷歌上搜索已经有一段时间了,我得到的指导褒贬不一.
- 将填充为零的列添加到较小的x_test
- 使用SCRICKIT-学习管道
下面是代码片段:
# read dfs
df_1 = pd.read_csv("data1.csv",header=0) # for training, has text, and class columns
df_2 = pd.read_csv("data2.csv",header=0) # for testing, has text, and class columns
# vectorise
CV1 = CountVectorizer(ngram_range=(1,3), stop_words="english").fit(df_1['text'])
x_train = CV1.transform(df_1['text'])
y_train = df_1['class']
CV2 = CountVectorizer(ngram_range=(1,3), stop_words="english").fit(df_2['text'])
x_test = CV2.transform(df_2['text'])
y_test = df_test['class']
## shapes of objects
## x_test (1589, 55229), y_test(1589,)
## x_train (16716, 387964), y_train(16716,)
# build classifier and predict
classifier = DecisionTreeClassifier(random_state=1234)
model = classifier.fit(x_train,y_train)
y_pred = model.predict(x_test)
# error ValueError: X has 55229 features, but DecisionTreeClassifier is expecting 387964 features as input.