我想根据关键字的特定字符串创建类别,该字符串比指定的类别"其他"更为真实.

例如,如果在列中找到"健康",则将该关键字行命名为"健康",如果是"治疗师",则命名为"治疗师"

  1. 通过代码创建"类别"列
  2. 根据条件分配类别

我可以在Excel上创建一个表,并使用索引匹配,我想切换到Python,将其应用于大型数据集,

Below is the sample data,

keyword category
HR Consultancy UK-d-uk-159_bing other
it support COMPANY LONDON-D-UK-G1161_bing other
global sales training platform openings sales
tele private practice therapist therapist
asset grant management system other
digital team project management solution openings other
global training platform openings other
tele practice therapist therapist
global sales training platform openings sales
tele health practice health
asset grant management other
digital team project management solution other

推荐答案

你可以将正则表达式与所有关键字一起使用.然后,根据您是想获得第一个匹配还是全部匹配,分别使用extractextractall进行聚合.

我添加了关键字"private"作为示例,以查看第3行中的差异:

import re
words = ['health', 'therapist', 'sales', 'private']
regex = '|'.join(map(re.escape, words))
# 'health|therapist|sales|private'

# option 1: get first match
df['category_first'] = (df['keyword']
 .str.extract(f'(?i)({regex})', expand=False)
 .fillna('other')
 )

# option 2: get all matches
df['category_all'] = (df['keyword']
 .str.extractall(f'(?i)({regex})')
 [0].groupby(level=0).agg(','.join)
 .reindex(df.index, fill_value='other')
 )

print(df)

输出:

                                              keyword   category category_first       category_all
0                     HR Consultancy UK-d-uk-159_bing      other          other              other
1           it support COMPANY LONDON-D-UK-G1161_bing      other          other              other
2             global sales training platform openings      sales          sales              sales
3                     tele private practice therapist  therapist        private  private,therapist
4                       asset grant management system      other          other              other
5   digital team project management solution openings      other          other              other
6                   global training platform openings      other          other              other
7                             tele practice therapist  therapist      therapist          therapist
8             global sales training platform openings      sales          sales              sales
9                                tele health practice     health         health             health
10                             asset grant management      other          other              other
11           digital team project management solution      other          other              other

Python相关问答推荐

如何让程序打印新段落上的每一行?

在Mac上安装ipython

Python键入协议默认值

对所有子图应用相同的轴格式

优化器的运行顺序影响PyTorch中的预测

组/群集按字符串中的子字符串或子字符串中的字符串轮询数据框

Stacked bar chart from billrame

如何在Python中找到线性依赖mod 2

为什么np. exp(1000)给出溢出警告,而np. exp(—100000)没有给出下溢警告?

AES—256—CBC加密在Python和PHP中返回不同的结果,HELPPP

为什么常规操作不以其就地对应操作为基础?

使用字典或列表的值组合

我可以不带视频系统的pygame,只用于游戏手柄输入吗?''

查找查找表中存在的列值组合

BeatuifulSoup从欧洲志愿者服务中获取数据和解析:一个从EU-Site收集机会的小铲子

启动线程时,Python键盘模块冻结/不工作

Matplotlib中的曲线箭头样式

我怎样才能让深度测试在OpenGL中使用Python和PyGame呢?

关于数字S种子序列内部工作原理的困惑

牛郎星直方图中分类列的设置顺序