所以,我试图解析一个PDF文件(30000x2000点),使用Python,其中有各种数据,表格,线条,文本,注释,图像等.The goal:在PDF上找到某个文本字符串,并返回一个接近文本的注释.我用PyPDF2找到所有的笔记和它们的坐标.为了找到文本字符串和它们的坐标,我使用fitz.
Using fitz, I searched for 'A715_X1'.个
import fitz
doc = fitz.open(path_pdf)
for page in doc:
coordinates_of_item_found_on_print = page.search_for('A715_X1')
Result: X coordinate: 9076 points个
Then, using PyPDF2, I searched for the 'Some text here' note.
from PyPDF2 import PdfReader
reader = PdfReader(sample.pdf)
for page in reader.pages:
if "/Annots" in page:
for annot in page["/Annots"]:
obj = annot.get_object()
markup_coordinates = obj['/Rect']
if obj['/Subtype'] == '/FreeText': # skip Stamp, Popup, Square, PolyLine
if obj['/Contents'] == 'Some text here':
try:
markup_loc.append(str(round(markup_coordinates[0])))
print('Note: ' + obj['/Contents'] + ' X-coord: ' + str(round(markup_coordinates[0])))
except Exception as e:
print(e)
Result: Note: Some text here X-coord: 4280个
我附上了一个来自PDFXchange的屏幕截图,显示‘A715_X1’就在‘一些文本’注释的正下方,所以X坐标应该大致相同+-几个点.从输出上看,根据尺子的线条,9076似乎是一个"真正的"价值.那为什么我会得到两个不同的坐标?
示例pdf: