晚上好
我正在使用python将PDF转换为CSV,并使用正则表达式提取信息.
从PDF中提取文本后,原始文本可能如下所示:
Account Transaction Details
Twin Account 123-456-789-1
Date Description Withdrawals Deposits Balance
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
WIRE OTHR
ANTON HARLEY
Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78
PIB8452145632845963
Abricot 480
OTHR Transfer
我使用了正则表达式[0-3]{1}[0-9]{1}\s[A-Z]{1}[a-z]{2}\s[?A-Za-z]{1,155}
,并设法获得了所需的事务:
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
03 Jan Funds Trf - SPEED 3,500.00 123,345.78
然而,匹配之间的附加信息被删除了,因为我使用\n
分割了文本,然后运行正则表达式.
如何进行编码,以获得正则表达式匹配之间的附加信息,并将附加信息标记到上一个匹配?这是我设想的输出:
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78 OTHR ANTON HARLEY Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer
编辑:
我已经采用了@dcsuka解决方案,并获得了以下结果:
06 Jan Debit-Consumer 12.60 123,456.78 SNIP AVENU13568100 4265884035605848
06 Jan Inward DR - 828.24 123,456.78 SHIP G12345HUJ ITX
07 Jan Funds Transfer 50.00 123,456.78 Pleasenotethatyouareboundbyadutyundertherulesgoverningtheoperationofthisaccount,tochecktheentriesintheabovestatement. Ifyoudonotnotifyusinwritingofanyerrors, omissionsorunauthoriseddebitswithinfourteen(14)daysofthisstatement,theentriesaboveshallbedeemedvalid,correct,accurateandconclusivelybindinguponyou,andyoushallhaveno claim against the bank in relation thereto. XYZ Ltd • 80 QuincyPlace ABC Plaza XXX 12345 • Co. Reg. No. 1234567890Z • GST Reg. No. YY-8121234-2 • www.xyzabc.com
07 Jan Inward CR - SPEED 9,092.06 123,456.78 SALAD SALAS Payment CARL QWE 817264950
如何删除多余的单词"Pleasenotethatyouareboundbyadut...
".我能看到的唯一模式是它将是一个非常长的单词(可能超过20个字符).这就是路吗?