给出下表:
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})
这看起来是这样的:
code
0 100M
1 60M10N40M
2 5S99M
3 1S25I100M
4 1D1S1I200M
我想将code
列字符串转换为数字,其中M、N、D分别等于(乘1),I等于(乘-1),S等于(乘0).
结果应该如下所示:
code Val
0 100M 100 This is (100*1)
1 60M10N40M 110 This is (60*1)+(10*1)+(40*1)
2 5S99M 99 This is (5*0)+(99*1)
3 1S25I100M 75 This is (1*0)+(25*-1)+(100*1)
4 1D1S1I200M 200 This is (1*1)+(1*0)+(1*-1)+(200*1)
我为此编写了以下函数:
def String2Val(String):
# Generate substrings
sstrings = re.findall('.[^A-Z]*.', String)
KeyDict = {'M':'*1','N':'*1','I':'*-1','S':'*0','D':'*1'}
newlist = []
for key, value in KeyDict.items():
for i in sstrings:
if key in i:
p = i.replace(key, value)
lp = eval(p)
newlist.append(lp)
OutputVal = sum(newlist)
return OutputVal
df['Val'] = df.apply(lambda row: String2Val(row['code']), axis = 1)
在将该函数应用于表之后,我意识到当应用于大型数据集时,它的效率很低,而且耗时很长.我如何才能优化这个过程?