我创建了一个回归模型字典,从训练数据集d
中以group
的值为索引
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
d = pd.DataFrame({
"group":["cat","fish","horse","cat","fish","horse","cat","horse"],
"x":[1,4,7,2,5,8,3,9],
"y":[10,20,14,12,12,3,12,2],
"z":[3,5,3,5,9,1,2,3]
})
features, models =['x','z'],{}
for animal in ['horse','cat','fish']:
models[animal] = Pipeline([("estimator",LinearRegression(fit_intercept=True))])
x,y = d.loc[d.group==animal,features],d.loc[d.group==animal,"y"]
models[animal].fit(x,y)
我还有一个测试数据集test_d
,其中有一些组的行,但不是所有的组(即所有模型).
test_d = pd.DataFrame({
"group":["dog","fish","horse","dog","fish","horse","dog","horse"],
"x":[1,2,3,4,5,6,7,8],
"z":[3,5,3,5,9,1,2,3]
})
我想在分组的test_d
上使用apply
,利用.name
查找正确的模型(如果存在),并使用函数f()
返回预测
def f(g):
try:
predictions = models[g.name].predict(g[features])
except:
predictions = [None]*len(g)
return predictions
函数"工作"的意思是返回正确的值
grouping_column ="group"
test_d.groupby(grouping_column, group_keys=False).apply(f)
输出:
group
dog [None, None, None]
fish [20.94117647058824, 12.000000000000004]
horse [38.0, 15.0, 8.0]
dtype: object
问题:
应该如何写f()
,以便我可以直接将值分配给test_d
?我想这样做:
test_d["predictions"] = test_d.groupby(grouping_column, group_keys=False).apply(f)
但这显然行不通.
group x z predictions
0 dog 1 3 NaN
1 fish 2 5 NaN
2 horse 3 3 NaN
3 dog 4 5 NaN
4 fish 5 9 NaN
5 horse 6 1 NaN
6 dog 7 2 NaN
7 horse 8 3 NaN
预期yields
group x z predictions
0 dog 1 3 NaN
1 fish 2 5 20.941176
2 horse 3 3 38.000000
3 dog 4 5 NaN
4 fish 5 9 12.000000
5 horse 6 1 15.000000
6 dog 7 2 NaN
7 horse 8 3 8.000000