我正在使用正则表达式来提取文本中日期对的月份和年份:
regex = (
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})?\s?\s?((to)|[\|\-\–\—])\s?\s?"
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})|(Present|Now|till\s?(now|date|today)?|current)))"
)
当我使用一些输入测试正则表达式时,这些输入在一些输入中包含月份的日期,而在其他输入中不包含:
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = re.finditer(regex,i,re.IGNORECASE)
for match in word:
print(match.group())
我得到以下输出:
Jan 2008 - May 2012
但我的预期输出是:
July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012
我需要更改什么才能使正则表达式与日期中具有可选日期的文本匹配?当日期字符串包括日期时,它始终是一个带有st
、nd
、rd
或th
后缀的序号.