Getting started

pyxpdf is a wrapper on xpdf reader sources.

It aims to provide a fast and memory efficient pdf parser with easy to use API.

Installation

pip install pyxpdf

For additional encodings support, install optional dependency pyxpdf_data

pip install pyxpdf_data

For Image extraction and pdf to image support, install optional dependency Pillow

pip install Pillow

Quick Start

pyxpdf use Document to represent and load a PDF file. Similary Page for PDF Page.

All the xpdf related settings can be accessed with global Config object.

from pyxpdf import Document, Page, Config
from pyxpdf.xpdf import TextControl

doc = Document("samples/nonfree/mandarin.pdf")
# or
# load pdf from file like object
with open("samples/nonfree/mandarin.pdf", 'rb') as fp:
    doc = Document(fp)

# get pdf metadata dict
print(doc.info())
# >>> doc.info()
# {'CreationDate': "D:20080721141207-04'00'",
#  'Subject': 'Chinese Version of Universal PCXR8 ...',
#  'Author': 'SKC Inc.',
#  'Creator': 'PScript5.dll
#   .....

# get all text
all_text = doc.text()

# iter first 10 pages
for page in doc[:10]:
    # get page label if any
    print(page.label)

# get page by page label
label_page = doc['1']

# get text in table layout without discarding clipped
# text.
text_control = TextControl("table", discard_clipped=True)
text = label_page.text(control=text_control)

# find case sensitive text within [x_min, y_min, x_max, y_max]
res_box = label_page.find_text('操作说明', search_box=[0, 0, 400, 400],
                                case_sensitive=True)
# >>> print(res_box)
# (281.88, 269.718, 354.05819999999994, 287.7)

# load xpdfrc
Config.load_file('my_xpdfrc')
# suppress stderr output for xpdf error log.
Config.error_quiet = False

Checkout API Reference for more details.

Todo

Add bechmark and speed comparison with python pdf modules