我有一些pascal格式的文本,我试图分割成单独的令牌/单词.
例如,"Hello123AIIsCool"
会变成["Hello", "123", "AI", "Is", "Cool"]
.
Some Conditions
- 单词总是以大写字母开头.例如,
"Hello"
- 一个连续的数字序列应该放在一起.例如,
"123"
—["123"]
,不是["1", "2", "3"]
> - 当最后一个字母是第一个条件中定义的新词的开始时,大写字母的连续序列应该保持在一起except.例如,
"ABCat"
—["AB", "Cat"]
,不是["ABC", "at"]
> - 不能保证每个条件在字符串中都有匹配项.例如,
"Hello"
,"HelloAI"
,"HelloAIIsCool"
,"Hello123"
,"123AI"
,"AIIsCool"
以及我没有提供的任何其他组合都是潜在的候选者.
我试过几种正则表达式的变体.接下来的两次try 让我非常接近我想要的,但并不完全.
Version 0
import re
def extract_v0(string: str) -> list[str]:
word_pattern = r"[A-Z][a-z]*"
num_pattern = r"\d+"
pattern = f"{word_pattern}|{num_pattern}"
extracts: list[str] = re.findall(
pattern=pattern, string=string
)
return extracts
string = "Hello123AIIsCool"
extract_v0(string)
['Hello', '123', 'A', 'I', 'Is', 'Cool']
Version 1
import re
def extract_v1(string: str) -> list[str]:
word_pattern = r"[A-Z][a-z]+"
num_pattern = r"\d+"
upper_pattern = r"[A-Z][^a-z]*"
pattern = f"{word_pattern}|{num_pattern}|{upper_pattern}"
extracts: list[str] = re.findall(
pattern=pattern, string=string
)
return extracts
string = "Hello123AIIsCool"
extract_v1(string)
['Hello', '123', 'AII', 'Cool']
Best Option So Far
这使用了regex和循环的组合.这是可行的,但这是最好的解决方案吗?或者有什么奇特的正则表达式可以做到这一点?
import re
def extract_v2(string: str) -> list[str]:
word_pattern = r"[A-Z][a-z]+"
num_pattern = r"\d+"
upper_pattern = r"[A-Z][A-Z]*"
groups = []
for pattern in [word_pattern, num_pattern, upper_pattern]:
while string.strip():
group = re.search(pattern=pattern, string=string)
if group is not None:
groups.append(group)
string = string[:group.start()] + " " + string[group.end():]
else:
break
ordered = sorted(groups, key=lambda g: g.start())
return [grp.group() for grp in ordered]
string = "Hello123AIIsCool"
extract_v2(string)
['Hello', '123', 'AI', 'Is', 'Cool']