Python 阻止 spacy 删除拆分字符串中的停用词

发布于04月19日

我正在try 使用Spacy删除从CSV创建的Pandas 数据帧中的停用字. 我的问题是，我试图解释那些可能混合了单词和数字的单词.

My issue:个

如果数字分隔单词以使其包含停止字，它将删除该词的这一部分.

    Ex. With stop word at the end
        Input: 'co555in'
        Breaks up the word, separating it in 'co'+ 555 + 'in'
        Removes 'in' because it is a stop word.
        Output: 'co555'

    Ex. Without stop word at the end
        Input: 'co555inn'
        Breaks up the word, separating it in 'co'+ 555 + 'inn'
        Will not remove 'inn' because it is not a stop word.
        Output: 'co555inn'

Current implementation:个

    df[col] = df[col].apply(lambda text: 
            "".join(token.lemma_ for token in nlp(text) 
            if not token.is_stop))

所以我想要的是能够解释数字和单词的混合，而不是拼写过滤掉单词的一部分，如果数字分隔字符串，那么它包含一个停用词.

推荐答案

编辑2:简化代码.添加了使用Pythonregular expressions library从文本中删除包含数字字符的任何单词的功能，然后对所有其他文本进行标记化.还添加了额外的安全措施，以确保标点符号不会导致错误.

下面是我的remove_stopwords方法，以及一些我用于测试的附加代码.

import spacy
import pandas as pd
import re

nlp = spacy.load('en_core_web_sm')

def remove_stopwords(text):
    """
    Removes stop words from a text source
    """
    number_words = re.findall(r'\w*\d+\w*', text)
    remove_numbers = re.sub(r'\w*\d+\w*', '', text)
    split_text = re.split(r'\W+', remove_numbers)
    remove_stop_words = [word for word in split_text if not nlp.vocab[word].is_stop]
    final_words = number_words + remove_stop_words
    return " ".join(final_words)

df = pd.read_csv('input_file.csv', sep='\t') # replace with your CSV file
df['text'] = df['text'].apply(remove_stopwords)
df.to_csv('output_file.csv', index=False) # replace with your desired output file name

Python相关问答推荐

Python 阻止 spacy 删除拆分字符串中的停用词

推荐答案

Python相关问答推荐

Chatgpt API不断返回错误：404未能从API获取响应

提取两行之间的标题的常规表达

如果条件为真，则Groupby.mean()

如何更改分组条形图中条形图的 colored颜色？

OR—Tools CP SAT条件约束

Stacked bar chart from billrame

从spaCy的句子中提取日期

Odoo 16使用NTFS使字段只读

Python逻辑操作作为Pandas中的条件

如何指定列数据类型

通过ManyToMany字段与Through在Django Admin中过滤

解决调用嵌入式函数的XSLT中表达式的语法移位/归约冲突

可以bcrypts AES—256 GCM加密损坏ZIP文件吗？

Python Tkinter为特定样式调整所有ttkbootstrap或ttk Button填充的大小，适用于所有主题

如何创建引用列表并分配值的Systemrame列

Flask运行时无法在Python中打印到控制台

Pandas—MultiIndex Resample—我不想丢失其他索引的信息´

我什么时候应该使用帆布和标签？

使用np.fft.fft2和cv2.dft重现相位谱.为什么结果并不相似呢？

在聚合中使用python-polars时如何计算模式