Document and Page¶
Document¶
-
class
pyxpdf.xpdf.
Document
(pdf, ownerpass=None, userpass=None)¶ This class represents a PDF Document.
Page
objects can be accessed though indexing, slicing.If pdf parameter is a file-like object then make sure that it is open in ‘b’ binary mode as Document does not check for it.
Examples
>>> doc = Document("~/sample.pdf")
Total pages in Document
>>> len(doc) 17
Access page by index
>>> page1 = doc[0] <Page[0]>
Access page by label
>>> cover_page = doc['Cover1'] <Page[0](label='Cover1')>
Get pages slice (all even number pages)
>>> even_pages = doc[0::2] [<Page[0]>, <Page[2]>, <Page[4]>, ...]
Iterate over pages
>>> for page in doc: ... print(page) <Page[0]> <Page[1]> <Page[2]> ...
- Parameters
- Raises
PDFPermissionError – If failed to decrypt encrypted PDF using given passwords
PDFIOError – If failed to load file from given file path
-
filename
¶ name of the file from which pdf document was loaded.
If pdf was loaded from file-like object then it will be a empty str.
- Type
-
info
(self)¶ Get the PDF’s info dictionary.
PDF info dictionary contains keys such as Author, Creator, ModDate, etc.
- Returns
PDf’s information dictionary.
- Return type
-
is_encrypted
¶ whether pdf is encrypted or not
Warning
Due to a bug in xpdf sources sometimes even non-encrypted PDF documents return
True
- Type
-
ok_to_change
¶ PDF change permission
- Whether PDF can be modified or not. Modifications include:
Inserting, Deleting, Rotating pages.
Commenting, filling in form fields, and signing existing signature fields.
- Type
-
ok_to_copy
¶ PDF copy permission.
Whether pdf content can be copied or not.
Note
PDF copy permission is required for extraction of text and images from document.
- Type
-
text
(self, start=0, end=- 1, control=None)¶ Parse and extract UTF-8 decoded text from given page range.
Extracted text can be adjusted using control parameter.
- Parameters
start (int) – index of first page to extract
end (int) – index of last page to extract
control (
TextControl
, optional) – AnTextControl
object, use to control the format of extacted text. (default isNone
which implies text will be extracted using default values from TextControl class)
- Returns
a ‘UTF-8’ decoded str object containing all the extracted text.
- Return type
Note
This method is almost similar to
text_bytes()
, the only difference is that it decodes the extracted bytes in UTF-8 with ‘ignore’ (codecs.ignore_errors()
) decoding error handler.
-
text_bytes
(self, int start=0, int end=-1, TextControl control=None)¶ Parse and extract text from given page range.
Extracted text can be adjusted using control parameter. This method should be use when text encoding (
Config.text_encoding
) is different than UTF-8 or when you to control decoding of bytes by yourself.- Parameters
start (int) – index of first page to extract
end (int) – index of last page to extract
control (
TextControl
) – AnTextControl
object, use to control the format of extacted text. (default isNone
which implies text will be extracted using default values from TextOutput class)
- Returns
a
Config.text_encoding
encoded bytes object containing all the extracted text.- Return type
Page¶
-
class
pyxpdf.xpdf.
Page
(doc, index)¶ Represents a PDF page
Examples
>>> page1 = doc[1]
Page index and label (if any)
>>> page1.index 1 >>> page1.label 'Cover1'
Page BBox(s)
>>> page1.mediabox (0.0, 0.0, 612.0, 792.0) >>> page1.cropbox (0.0, 0.0, 612.0, 792.0) >>> page1.mediabox (0.0, 0.0, 612.0, 792.0)
Find text location in Page
>>> page1.find_text("Hello") (100.0, 74.768, 117.328, 96.968)
- Parameters
- Raises
IndexError – If index parameter is outside page range
-
artbox
¶ Page’s art box cordinates
- Type
tuple of float, (x1, y1, x2, y2)
-
bleedbox
¶ Page’s bleed box cordinates
- Type
tuple of float, (x1, y1, x2, y2)
-
cropbox
¶ Page’s crop box cordinates
- Type
tuple of float, (x1, y1, x2, y2)
-
find_all_text
(self, text, search_box=None, case_sensitive=False, wholeword=False, rotation=0)¶ Find the text and get all the matches
Same as
find_text()
, but return all the matches.- Parameters
text (str) – Text to search in page
search_box (tuple of float, optional) – tuple of cordinates of BBox to set the search area. (default is
None
, means the whole page area)case_sensitive (bool, optional) – If
False
, match the text regardless of its case in page (default isTrue
)wholeword (bool, optional) – match the text as a whole word only. (default is
True
)rotation (int, optional) – rotation of page (default is 0)
-
find_text
(self, text, search_box=None, direction='top', case_sensitive=False, wholeword=False, rotation=0)¶ Find the text in Page.
Search for the text in given search_box (BBox) of page. If wholeword then try to match as a whole word.
if direction is ‘top’ then, start the search from top of page
if direction is ‘next’ then, get the next match from the page
if direction is ‘previous’ then, get the previous match from the page
- Parameters
text (str) – Text to search in page
search_box (tuple of float, optional) – tuple of cordinates of BBox to set the search area. (default is
None
, means the whole page area)direction ({'top', 'next', 'previous'}) – style of search
case_sensitive (bool, optional) – If
False
, match the text regardless of its case in page (default isTrue
)wholeword (bool, optional) – match the text as a whole word only. (default is
True
)rotation (int, optional) – rotation of page (default is 0)
- Returns
If match is found then tuple of cordinates(x1, y1, x2, y2) of BBox of text in page else
None
- Return type
tuple of float, None
See also
-
mediabox
¶ Page’s media box cordinates
- Type
tuple of float, (x1, y1, x2, y2)
-
text
(self, page_area=None, control=None)¶ Parse and extract UTF-8 decoded text from current page.
Extracted text can be adjusted using control parameter.
- Parameters
page_area (tuple of float, optional) – tuple of cordinates of BBox to set the extraction area. Only text which is inside provided page_area will be extracted. (default is
None
, means the whole page area)control (
TextControl
, optional) – AnTextControl
object, use to control the format of extacted text. (default isNone
which implies text will be extracted using default values from TextControl class)
- Returns
a ‘UTF-8’ decoded str object containing all the extracted text.
- Return type
Note
This method is almost similar to
text_bytes()
, the only difference is that it decodes the extracted bytes in UTF-8 with ‘ignore’ (codecs.ignore_errors()
) decoding error handler.See also
TextOutput()
PDF to Text output device with caching support.
-
text_bytes
(self, page_area=None, TextControl control=None)¶ Parse and extract text bytes from current page.
Extracted text can be adjusted using control parameter. This method should be use when text encoding (
Config.text_encoding
) is different than UTF-8 or when you to control decoding of bytes by yourself.- Parameters
page_area (tuple of float, optional) – tuple of cordinates of BBox to set the extraction area. Only text which is inside provided page_area will be extracted. (default is
None
, means the whole page area)control (
TextControl
) – AnTextControl
object, use to control the format of extacted text. (default isNone
which implies text will be extracted using default values from TextOutput class)
- Returns
a
Config.text_encoding
encoded bytes object containing all the extracted text.- Return type
See also
TextOutput()
PDF to Text output device with caching support.
-
trimbox
¶ Page’s trim box cordinates
- Type
tuple of float, (x1, y1, x2, y2)