Python 从spaCy的句子中提取日期

发布于04月07日

我有一个这样的字符串:

"The dates are from 30 June 2019 to 1 January 2022 inclusive"

我想用spaCy从这个字符串中提取日期.

以下是我到目前为止的功能:

def extract_dates_with_year(text):
    doc = nlp(text)
    dates_with_year = []
    for ent in doc.ents:
        if ent.label_ == "DATE":
            dates_with_year.append(ent.text)
    return dates_with_year

这将返回以下输出:

['30 June 2019 to 1 January 2022']

但是，我希望输出如下:

['30 June 2019', '1 January 2022']

推荐答案

问题是，"to"被认为是日期的一部分.因此，当你执行for ent in doc.ents时，你的循环只有一次迭代，因为"30 June 2019 to 1 January 2022"被认为是一个实体.

由于你不希望出现这种行为，你可以修改你的函数，使其在"to"上分裂:

def extract_dates_with_year(text):
    doc = nlp(text)
    dates_with_year = []
    for ent in doc.ents:
        if ent.label_ == "DATE":
            for ent_txt in ent.text.split("to"):
                dates_with_year.append(ent_txt.strip())
    return dates_with_year

这将正确地处理以下日期，以及单个日期和具有多个日期的字符串:

txt = """
     The dates are from 30 June 2019 to 1 January 2022 inclusive.
     And oddly also 5 January 2024.
     And exclude 21 July 2019 until 23 July 2019.
"""

extract_dates_with_year(txt)

# Output:
[
 '30 June 2019',
 '1 January 2022',
 '5 January 2024',
 '21 July 2019',
 '23 July 2019'
]