Document and Page¶

Document¶

class pyxpdf.xpdf.Document(pdf, ownerpass=None, userpass=None)¶

This class represents a PDF Document.

Page objects can be accessed though indexing, slicing.

If pdf parameter is a file-like object then make sure that it is open in ‘b’ binary mode as Document does not check for it.

Examples

>>> doc = Document("~/sample.pdf")

Total pages in Document

>>> len(doc)
17

Access page by index

>>> page1 = doc[0]
<Page[0]>

Access page by label

>>> cover_page = doc['Cover1']
<Page[0](label='Cover1')>

Get pages slice (all even number pages)

>>> even_pages = doc[0::2]
[<Page[0]>, <Page[2]>, <Page[4]>, ...]

Iterate over pages

>>> for page in doc:
...     print(page)
<Page[0]>
<Page[1]>
<Page[2]>
...

Parameters

pdf (str or file-like) – Path of pdf file to load or a file-like object.
ownerpass (str, optional) – Owner password of pdf file, if encrypted (default None)
userpass (str, optional) – User password of pdf file, if encrypted (default None)

Raises

PDFPermissionError – If failed to decrypt encrypted PDF using given passwords
PDFIOError – If failed to load file from given file path

filename¶

name of the file from which pdf document was loaded.

If pdf was loaded from file-like object then it will be a empty str.

Type: str

has_page_labels¶

whether pdf has page labels or not

Type: bool

info(self)¶

Get the PDF’s info dictionary.

PDF info dictionary contains keys such as Author, Creator, ModDate, etc.

Returns: PDf’s information dictionary.
Return type: dict

is_encrypted¶

whether pdf is encrypted or not

Warning

Due to a bug in xpdf sources sometimes even non-encrypted PDF documents return True

Type: bool

is_linearized¶

whether pdf is lineralised or not

Type: bool

num_pages¶

total pages in pdf

Type: int

ok_to_add_notes¶

PDF add notes permission

Type: bool

ok_to_change¶

PDF change permission

Whether PDF can be modified or not. Modifications include:

Inserting, Deleting, Rotating pages.
Commenting, filling in form fields, and signing existing signature fields.

Type: bool

ok_to_copy¶

PDF copy permission.

Whether pdf content can be copied or not.

Note

PDF copy permission is required for extraction of text and images from document.

Type: bool

ok_to_print¶

PDF print permission.

Whether document can be printed or not.

Type: bool

pdf_version¶

version of PDF standard pdf comply with

Type: float

text(self, start=0, end=- 1, control=None)¶

Parse and extract UTF-8 decoded text from given page range.

Extracted text can be adjusted using control parameter.

Parameters

start (int) – index of first page to extract
end (int) – index of last page to extract
control (TextControl, optional) – An TextControl object, use to control the format of extacted text. (default is None which implies text will be extracted using default values from TextControl class)

Returns

a ‘UTF-8’ decoded str object containing all the extracted text.

Return type

str

Note

This method is almost similar to text_bytes(), the only difference is that it decodes the extracted bytes in UTF-8 with ‘ignore’ (codecs.ignore_errors()) decoding error handler.

See also

Page.text()

TextOutput(): PDF to Text output device with caching support.

text_bytes(self, int start=0, int end=-1, TextControl control=None)¶

Parse and extract text from given page range.

Extracted text can be adjusted using control parameter. This method should be use when text encoding (Config.text_encoding) is different than UTF-8 or when you to control decoding of bytes by yourself.

Parameters

start (int) – index of first page to extract
end (int) – index of last page to extract
control (TextControl) – An TextControl object, use to control the format of extacted text. (default is None which implies text will be extracted using default values from TextOutput class)

Returns

a Config.text_encoding encoded bytes object containing all the extracted text.

Return type

bytes

See also

Page.text_bytes()

TextOutput(): PDF to Text output device with caching support.

xmp_metadata(self)¶

Get the PDF’s xmp metadata.

Returns
Return type: str

Page¶

class pyxpdf.xpdf.Page(doc, index)¶

Represents a PDF page

Examples

>>> page1 = doc[1]

Page index and label (if any)

>>> page1.index
1
>>> page1.label
'Cover1'

Page BBox(s)

>>> page1.mediabox
(0.0, 0.0, 612.0, 792.0)
>>> page1.cropbox
(0.0, 0.0, 612.0, 792.0)
>>> page1.mediabox
(0.0, 0.0, 612.0, 792.0)

Find text location in Page

>>> page1.find_text("Hello")
(100.0, 74.768, 117.328, 96.968)

Parameters

doc (Document) – Parent pdf Document
index (int) – index of pdf Page

Raises

IndexError – If index parameter is outside page range

doc¶

Parent pdf document

Type: Document, readonly

index¶

Type: int, readonly

label¶

Type: str, readonly

artbox¶

Page’s art box cordinates

Type: tuple of float, (x1, y1, x2, y2)

bleedbox¶

Page’s bleed box cordinates

Type: tuple of float, (x1, y1, x2, y2)

crop_height¶

page cropbox width

Type: float

crop_width¶

page cropbox width

Type: float

cropbox¶

Page’s crop box cordinates

Type: tuple of float, (x1, y1, x2, y2)

find_all_text(self, text, search_box=None, case_sensitive=False, wholeword=False, rotation=0)¶

Find the text and get all the matches

Same as find_text(), but return all the matches.

Parameters

text (str) – Text to search in page
search_box (tuple of float, optional) – tuple of cordinates of BBox to set the search area. (default is None, means the whole page area)
case_sensitive (bool, optional) – If False, match the text regardless of its case in page (default is True)
wholeword (bool, optional) – match the text as a whole word only. (default is True)
rotation (int, optional) – rotation of page (default is 0)

find_text(self, text, search_box=None, direction='top', case_sensitive=False, wholeword=False, rotation=0)¶

Find the text in Page.

Search for the text in given search_box (BBox) of page. If wholeword then try to match as a whole word.

if direction is ‘top’ then, start the search from top of page

if direction is ‘next’ then, get the next match from the page

if direction is ‘previous’ then, get the previous match from the page

Parameters

text (str) – Text to search in page
search_box (tuple of float, optional) – tuple of cordinates of BBox to set the search area. (default is None, means the whole page area)
direction ({'top', 'next', 'previous'}) – style of search
case_sensitive (bool, optional) – If False, match the text regardless of its case in page (default is True)
wholeword (bool, optional) – match the text as a whole word only. (default is True)
rotation (int, optional) – rotation of page (default is 0)

Returns

If match is found then tuple of cordinates(x1, y1, x2, y2) of BBox of text in page else None

Return type

tuple of float, None

See also

find_all_text()

is_cropped¶

whether page is cropped or not

Type: bool

media_height¶

page mediabox height

Type: float

media_width¶

page mediabox width

Type: float

mediabox¶

Page’s media box cordinates

Type: tuple of float, (x1, y1, x2, y2)

rotation¶

page rotation in degrees

Type: int

text(self, page_area=None, control=None)¶

Parse and extract UTF-8 decoded text from current page.

Extracted text can be adjusted using control parameter.

Parameters

page_area (tuple of float, optional) – tuple of cordinates of BBox to set the extraction area. Only text which is inside provided page_area will be extracted. (default is None, means the whole page area)
control (TextControl, optional) – An TextControl object, use to control the format of extacted text. (default is None which implies text will be extracted using default values from TextControl class)

Returns

a ‘UTF-8’ decoded str object containing all the extracted text.

Return type

str

Note

This method is almost similar to text_bytes(), the only difference is that it decodes the extracted bytes in UTF-8 with ‘ignore’ (codecs.ignore_errors()) decoding error handler.

See also

TextOutput(): PDF to Text output device with caching support.

text_bytes(self, page_area=None, TextControl control=None)¶

Parse and extract text bytes from current page.

Extracted text can be adjusted using control parameter. This method should be use when text encoding (Config.text_encoding) is different than UTF-8 or when you to control decoding of bytes by yourself.

Parameters

page_area (tuple of float, optional) – tuple of cordinates of BBox to set the extraction area. Only text which is inside provided page_area will be extracted. (default is None, means the whole page area)
control (TextControl) – An TextControl object, use to control the format of extacted text. (default is None which implies text will be extracted using default values from TextOutput class)

Returns

a Config.text_encoding encoded bytes object containing all the extracted text.

Return type

bytes

See also

TextOutput(): PDF to Text output device with caching support.

trimbox¶

Page’s trim box cordinates

Type: tuple of float, (x1, y1, x2, y2)

Document and Page¶

Document¶

Page¶

pyxpdf

Navigation

Related Topics