TextOutput Device

In TextOutput output Device we use TextControl to set settings for text extraction/analysis

TextOutput

class pyxpdf.xpdf.TextOutput

Text extract/analysis PDF Output device

Extract text and do layout analysis on from PDF Document while caching results. Page texts are cached for faster access. Page texts are lazy loaded, they are loaded only when you first access them.

Parameters
  • doc (Document) – PDF Document for this output device

  • control (TextControl, optional) – An TextControl object for settings to adjust TextControl extraction/analysis. (default is None)

  • kwargsTextControl parameters which will be used if control is not provided.

doc

Parent PDF Document

Type

Document, readonly

control

Layout settings for output device

Type

TextControl

Raises

XPDFInternalError – If cannot initialize internal xpdf objects will settings provided

get(self, int page_no)

Get the extracted UTF-8 decoded str from page_no indexed page

This method is almost similar to get_bytes(), the only difference is that it decodes the extracted bytes in UTF-8 with ‘ignore’ (codecs.ignore_errors()) decoding error handler.

Parameters

page_no (int) – index of page to extract text bytes from

Returns

extracted UTF-8 decoded text

Return type

str

get_all(self)list

Get the extracted UTF-8 decoded text from all pages

Returns

list of UTF-8 decoded text from all the pages

Return type

list of str

get_bytes(self, int page_no)bytes

Get the extracted text bytes from page_no indexed page

This method should be use when text encoding (Config.text_encoding) is different than UTF-8 or when you to control decoding of bytes by yourself.

Parameters

page_no (int) – index of page to extract text bytes from

Returns

extracted text bytes

Return type

bytes

TextControl

class pyxpdf.xpdf.TextControl

Parameters for Text extraction and layout analysis

Text layout modes:
  • reading

    Keep the text in reading order. It ‘undo’ physical layout (columns, hyphenation, etc.) and output the text in reading order.

  • physical

    Maintain (as best as possible) the original physical layout of the text. If the fixed_pitch option is given, character spacing within each line will be determined by the specified character pitch.

  • table

    It is similar to physical layout mode, but optimized for tabular data, with the goal of keeping rows and columns aligned (at the expense of inserting extra whitespace). If the fixed_pitch option is given, character spacing within each line will be determined by the specified character pitch.

  • simple

    Similar to physical layout, but optimized for simple one-column pages. This mode will do a better job of maintaining horizontal spacing, but it will only work properly with a single column of text.

  • lineprinter

    Line printer mode uses a strict fixed character pitch and height layout. That is, the page is broken into a grid, and characters are placed into that grid. If the grid spacing is too small for the actual characters, the result is extra whitespace. If the grid spacing is too large, the result is missing whitespace. The grid spacing can be specified using the fixed_pitch and fixed_line_spacing options. If one or both are not given on the xpdf will attempt to compute appropriate value(s).

  • raw

    Keep the text in content stream order. Depending on how the PDF file was generated, this may or may not be useful.

Parameters
  • mode ({"reading", "table", "simple", "physical", "lineprinter", "raw"}) – text analysis/extraction layout mode

  • fixed_pitch (float, optional) – Specify the character pitch (character width), for physical , table ,or lineprinter mode. This is ignored in all other modes. (default is 0, means approximate characters’ pitch will be calculated)

  • fixed_line_spacing (float, optional) – Specify the line spacing, in points, for lineprinter mode. This is ignored in all other modes. (default is 0, means approximate line spacing will be calculated)

  • enable_html (bool, optional) – enable extra proccessing for html. (default is False)

  • clip_text (bool, optional) – Text which is hidden because of clipping is removed before doing layout, and then added back in. This can be helpful for tables where clipped (invisible) text would overlap the next column. (default is False)

  • discard_clipped (bool, optional) – discard all clipped characters (default is False)

  • discard_diagonal (bool, optional) – Diagonal text, i.e., text that is not close to one of the 0, 90, 180, or 270 degree axes, is discarded. This is useful to skip watermarks drawn on top of body text, etc. (default is False)

  • discard_invisible (bool, optional) – discard all invinsible characters (default is False)

  • insert_bom (bool, optional) – Insert a Unicode byte order marker (BOM) at the start of the text output.

  • margin_left (float, optional) – Specifies the left margin. Text in the left margin (i.e., within that many points of the left edge of the page) is discarded. (default is 0)

  • margin_right (float, optional) – Specifies the right margin. Text in the right margin (i.e., within that many points of the right edge of the page) is discarded. (default is 0)

  • margin_top (float, optional) – Specifies the top margin. Text in the top margin (i.e., within that many points of the top edge of the page) is discarded. (default is 0)

  • margin_bottom (float, optional) – Specifies the bottom margin. Text in the bottom margin (i.e., within that many points of the bottom edge of the page) is discarded. (default is 0)

Raises

ValueError – If mode invalid