Getting started¶
pyxpdf
is a wrapper on xpdf reader
sources.
It aims to provide a fast and memory efficient pdf parser with easy to use API.
Installation¶
pip install pyxpdf
For additional encodings support, install optional dependency pyxpdf_data
pip install pyxpdf_data
For Image extraction and pdf to image support, install optional dependency Pillow
pip install Pillow
Quick Start¶
pyxpdf
use Document
to represent and load a PDF file.
Similary Page
for PDF Page.
All the xpdf related settings can be accessed with global Config
object.
from pyxpdf import Document, Page, Config
from pyxpdf.xpdf import TextControl
doc = Document("samples/nonfree/mandarin.pdf")
# or
# load pdf from file like object
with open("samples/nonfree/mandarin.pdf", 'rb') as fp:
doc = Document(fp)
# get pdf metadata dict
print(doc.info())
# >>> doc.info()
# {'CreationDate': "D:20080721141207-04'00'",
# 'Subject': 'Chinese Version of Universal PCXR8 ...',
# 'Author': 'SKC Inc.',
# 'Creator': 'PScript5.dll
# .....
# get all text
all_text = doc.text()
# iter first 10 pages
for page in doc[:10]:
# get page label if any
print(page.label)
# get page by page label
label_page = doc['1']
# get text in table layout without discarding clipped
# text.
text_control = TextControl("table", discard_clipped=True)
text = label_page.text(control=text_control)
# find case sensitive text within [x_min, y_min, x_max, y_max]
res_box = label_page.find_text('操作说明', search_box=[0, 0, 400, 400],
case_sensitive=True)
# >>> print(res_box)
# (281.88, 269.718, 354.05819999999994, 287.7)
# load xpdfrc
Config.load_file('my_xpdfrc')
# suppress stderr output for xpdf error log.
Config.error_quiet = False
Checkout API Reference for more details.
Todo
Add bechmark and speed comparison with python pdf modules