Document and Page

Document

class pyxpdf.xpdf.Document(pdf, ownerpass=None, userpass=None)

This class represents a PDF Document.

Page objects can be accessed though indexing, slicing.

If pdf parameter is a file-like object then make sure that it is open in ‘b’ binary mode as Document does not check for it.

Examples

>>> doc = Document("~/sample.pdf")

Total pages in Document

>>> len(doc)
17

Access page by index

>>> page1 = doc[0]
<Page[0]>

Access page by label

>>> cover_page = doc['Cover1']
<Page[0](label='Cover1')>

Get pages slice (all even number pages)

>>> even_pages = doc[0::2]
[<Page[0]>, <Page[2]>, <Page[4]>, ...]

Iterate over pages

>>> for page in doc:
...     print(page)
<Page[0]>
<Page[1]>
<Page[2]>
...
Parameters
  • pdf (str or file-like) – Path of pdf file to load or a file-like object.

  • ownerpass (str, optional) – Owner password of pdf file, if encrypted (default None)

  • userpass (str, optional) – User password of pdf file, if encrypted (default None)

Raises
  • PDFPermissionError – If failed to decrypt encrypted PDF using given passwords

  • PDFIOError – If failed to load file from given file path

filename

name of the file from which pdf document was loaded.

If pdf was loaded from file-like object then it will be a empty str.

Type

str

has_page_labels

whether pdf has page labels or not

Type

bool

info(self)

Get the PDF’s info dictionary.

PDF info dictionary contains keys such as Author, Creator, ModDate, etc.

Returns

PDf’s information dictionary.

Return type

dict

is_encrypted

whether pdf is encrypted or not

Warning

Due to a bug in xpdf sources sometimes even non-encrypted PDF documents return True

Type

bool

is_linearized

whether pdf is lineralised or not

Type

bool

num_pages

total pages in pdf

Type

int

ok_to_add_notes

PDF add notes permission

Type

bool

ok_to_change

PDF change permission

Whether PDF can be modified or not. Modifications include:
  • Inserting, Deleting, Rotating pages.

  • Commenting, filling in form fields, and signing existing signature fields.

Type

bool

ok_to_copy

PDF copy permission.

Whether pdf content can be copied or not.

Note

PDF copy permission is required for extraction of text and images from document.

Type

bool

ok_to_print

PDF print permission.

Whether document can be printed or not.

Type

bool

pdf_version

version of PDF standard pdf comply with

Type

float

text(self, start=0, end=- 1, control=None)

Parse and extract UTF-8 decoded text from given page range.

Extracted text can be adjusted using control parameter.

Parameters
  • start (int) – index of first page to extract

  • end (int) – index of last page to extract

  • control (TextControl, optional) – An TextControl object, use to control the format of extacted text. (default is None which implies text will be extracted using default values from TextControl class)

Returns

a ‘UTF-8’ decoded str object containing all the extracted text.

Return type

str

Note

This method is almost similar to text_bytes(), the only difference is that it decodes the extracted bytes in UTF-8 with ‘ignore’ (codecs.ignore_errors()) decoding error handler.

See also

Page.text()

TextOutput()

PDF to Text output device with caching support.

text_bytes(self, int start=0, int end=-1, TextControl control=None)

Parse and extract text from given page range.

Extracted text can be adjusted using control parameter. This method should be use when text encoding (Config.text_encoding) is different than UTF-8 or when you to control decoding of bytes by yourself.

Parameters
  • start (int) – index of first page to extract

  • end (int) – index of last page to extract

  • control (TextControl) – An TextControl object, use to control the format of extacted text. (default is None which implies text will be extracted using default values from TextOutput class)

Returns

a Config.text_encoding encoded bytes object containing all the extracted text.

Return type

bytes

See also

Page.text_bytes()

TextOutput()

PDF to Text output device with caching support.

xmp_metadata(self)

Get the PDF’s xmp metadata.

Returns

Return type

str

Page

class pyxpdf.xpdf.Page(doc, index)

Represents a PDF page

Examples

>>> page1 = doc[1]

Page index and label (if any)

>>> page1.index
1
>>> page1.label
'Cover1'

Page BBox(s)

>>> page1.mediabox
(0.0, 0.0, 612.0, 792.0)
>>> page1.cropbox
(0.0, 0.0, 612.0, 792.0)
>>> page1.mediabox
(0.0, 0.0, 612.0, 792.0)

Find text location in Page

>>> page1.find_text("Hello")
(100.0, 74.768, 117.328, 96.968)
Parameters
  • doc (Document) – Parent pdf Document

  • index (int) – index of pdf Page

Raises

IndexError – If index parameter is outside page range

doc

Parent pdf document

Type

Document, readonly

index
Type

int, readonly

label
Type

str, readonly

artbox

Page’s art box cordinates

Type

tuple of float, (x1, y1, x2, y2)

bleedbox

Page’s bleed box cordinates

Type

tuple of float, (x1, y1, x2, y2)

crop_height

page cropbox width

Type

float

crop_width

page cropbox width

Type

float

cropbox

Page’s crop box cordinates

Type

tuple of float, (x1, y1, x2, y2)

find_all_text(self, text, search_box=None, case_sensitive=False, wholeword=False, rotation=0)

Find the text and get all the matches

Same as find_text(), but return all the matches.

Parameters
  • text (str) – Text to search in page

  • search_box (tuple of float, optional) – tuple of cordinates of BBox to set the search area. (default is None, means the whole page area)

  • case_sensitive (bool, optional) – If False, match the text regardless of its case in page (default is True)

  • wholeword (bool, optional) – match the text as a whole word only. (default is True)

  • rotation (int, optional) – rotation of page (default is 0)

find_text(self, text, search_box=None, direction='top', case_sensitive=False, wholeword=False, rotation=0)

Find the text in Page.

Search for the text in given search_box (BBox) of page. If wholeword then try to match as a whole word.

if direction is ‘top’ then, start the search from top of page

if direction is ‘next’ then, get the next match from the page

if direction is ‘previous’ then, get the previous match from the page

Parameters
  • text (str) – Text to search in page

  • search_box (tuple of float, optional) – tuple of cordinates of BBox to set the search area. (default is None, means the whole page area)

  • direction ({'top', 'next', 'previous'}) – style of search

  • case_sensitive (bool, optional) – If False, match the text regardless of its case in page (default is True)

  • wholeword (bool, optional) – match the text as a whole word only. (default is True)

  • rotation (int, optional) – rotation of page (default is 0)

Returns

If match is found then tuple of cordinates(x1, y1, x2, y2) of BBox of text in page else None

Return type

tuple of float, None

See also

find_all_text()

is_cropped

whether page is cropped or not

Type

bool

media_height

page mediabox height

Type

float

media_width

page mediabox width

Type

float

mediabox

Page’s media box cordinates

Type

tuple of float, (x1, y1, x2, y2)

rotation

page rotation in degrees

Type

int

text(self, page_area=None, control=None)

Parse and extract UTF-8 decoded text from current page.

Extracted text can be adjusted using control parameter.

Parameters
  • page_area (tuple of float, optional) – tuple of cordinates of BBox to set the extraction area. Only text which is inside provided page_area will be extracted. (default is None, means the whole page area)

  • control (TextControl, optional) – An TextControl object, use to control the format of extacted text. (default is None which implies text will be extracted using default values from TextControl class)

Returns

a ‘UTF-8’ decoded str object containing all the extracted text.

Return type

str

Note

This method is almost similar to text_bytes(), the only difference is that it decodes the extracted bytes in UTF-8 with ‘ignore’ (codecs.ignore_errors()) decoding error handler.

See also

TextOutput()

PDF to Text output device with caching support.

text_bytes(self, page_area=None, TextControl control=None)

Parse and extract text bytes from current page.

Extracted text can be adjusted using control parameter. This method should be use when text encoding (Config.text_encoding) is different than UTF-8 or when you to control decoding of bytes by yourself.

Parameters
  • page_area (tuple of float, optional) – tuple of cordinates of BBox to set the extraction area. Only text which is inside provided page_area will be extracted. (default is None, means the whole page area)

  • control (TextControl) – An TextControl object, use to control the format of extacted text. (default is None which implies text will be extracted using default values from TextOutput class)

Returns

a Config.text_encoding encoded bytes object containing all the extracted text.

Return type

bytes

See also

TextOutput()

PDF to Text output device with caching support.

trimbox

Page’s trim box cordinates

Type

tuple of float, (x1, y1, x2, y2)