我试图获得df中出现次数最多的单词的计数,并按其他列值分组:
我有这样一个数据帧:
df=pd.DataFrame({'Category':['Red','Red','Blue','Yellow','Blue'],'Text':['this is very good ','good','dont like','stop','dont like']})
这是我计算文本列中关键字的方式:
from collections import Counter
top_N = 100
stopwords = nltk.corpus.stopwords.words('english')
# # RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df.Text
.str.lower()
.replace([r'\|', RE_stopwords], [' ', ''], regex=True)
.str.cat(sep=' ')
.split()
)
# generate DF out of Counter
df_top_words = pd.DataFrame(Counter(words).most_common(top_N),
columns=['Word', 'Frequency']).set_index('Word')
print(df_top_words)
这就产生了这个结果:
然而,这只是生成了数据框中所有单词的列表,我想要的是以下内容: