我在lxml文件中找到了bbox坐标,并使用PDFQuery提取了想要的数据.然后我将数据写入csv文件.
def pdf_scrape(pdf):
"""
Extract each relevant information individually
input: pdf to be scraped
returns: dataframe of scraped data
"""
# Define coordinates of text to be extracted
CUSTOMER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 563.285, 624.656, 580.888")').text()
CUSTOMER_REF = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 534.939, 443.186, 552.542")').text()
SALES_ORDER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 504.692, 414.352, 522.295")').text()
ITEM_NUMBER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 478.246, 395.129, 495.849")').text()
KEY = '0000'+ SALES_ORDER + '-' + '00' + ITEM_NUMBER
# Combine all relevant information into a single pandas dataframe
page = pd.DataFrame({
'KEY' : KEY,
'CUSTOMER' : CUSTOMER,
'CUSTOMER REF.': CUSTOMER_REF,
'SALES ORDER' : SALES_ORDER,
'ITEM NUMBER' : ITEM_NUMBER
}, index=[0])
return(page)
pdf_search = Path("files/").glob("*.pdf")
pdf_files = [str(file.absolute()) for file in pdf_search]
master = list()
for pdf_file in pdf_files:
pdf = pdfquery.PDFQuery(pdf_file)
pdf.load(0)
# Iterate over all pages in document and add scraped data to df
page = pdf_scrape(pdf)
master.append(page)
master = pd.concat(master, ignore_index=True)
master.to_csv('scraped_PDF_as_csv\scraped_PDF_DataFrame.csv', index = False)
问题是我每天需要阅读数百个PDF,而这个脚本需要大约13-14秒才能从10个PDF的第一页中挖掘出四个元素.
有没有办法加速我的代码?
我try 过使用PyMuPDF,因为它应该更快,但在实现它时遇到了问题,无法提供与PDFQuery相同的输出.有人知道怎么做吗?
重申一下,我知道所需文本在文档中的位置,但我不一定知道它说了什么.