我有包含各种事务的文本,我正在try 使用regex进行解析.
文本看起来像这样:
JT Meta Platforms, Inc. - Class A
Common Stock (META) [ST]S (partial) 02/08/2024 03/05/2024 $1,001 - $15,000
F S: New
S O: Morgan Stanley - Select UMA Account # 1
JT Microsoft Corporation - Common
Stock (MSFT) [ST]S (partial) 02/08/2024 03/05/2024 $1,001 - $15,000
F S: New
S O: Morgan Stanley - Select UMA Account # 1
JT Microsoft Corporation - Common
Stock (MSFT) [OP]P 02/13/2024 03/05/2024 $500,001 -
$1,000,000
F S: New
S O: Morgan Stanley - Portfolio Management Active Assets Account
D: Call options; Strike price $170; Expires 01/17 /2025
C: Ref: 044Q34N6
我创建了一个regex来解析单个交易,由股票代码(例如,(MSFT))、类型(例如,[ST]、[OP])和金额(例如,500,000美元等)的组合表示,如下所示:
transactions = rx.findall(r"\([A-Z][^$]*\$[^$]*\$[,\d]+", text)
事务以列表的形式返回,例如:
(META) [ST]S (partial) 02/08/2024 03/05/2024 $1,001 - $15,000
我想添加逻辑以包含描述详细信息(即,' D:. ')如果它们存在的话. 我try 了下面的模式,但它最终只返回一个大型事务,因为前两个事务没有描述详细信息(即"D:").
我想看看这个:
(META) [ST]S (partial) 02/08/2024 03/05/2024 $1,001 - $15,000
..
(MSFT) [ST]S (partial) 02/08/2024 03/05/2024 $1,001 - $15,000
..
(MSFT) [OP]P 02/13/2024 03/05/2024 $500,001 -
$1,000,000
F S: New
S O: Morgan Stanley - Portfolio Management Active Assets Account
D: Call options; Strike price $170; Expires 01/17 /2025
我做错了什么?
rx.findall(r"\([A-Z][^$]*\$[^$]*\$[,\d]+[\s\S]*?D:(.*)", text)
Edit:
为了处理结肠与"D"不连续的情况(不完美的PDF解析),添加到@zdim的答案中,这解决了上述问题:
rx.findall('\([A-Z][^$]*\$[^$]*\$[,\d]+(?:[^$]*D:?.+)?', text)