TextOutput Device¶

In TextOutput output Device we use TextControl to set settings for text extraction/analysis

TextOutput¶

class pyxpdf.xpdf.TextOutput¶

Text extract/analysis PDF Output device

Extract text and do layout analysis on from PDF Document while caching results. Page texts are cached for faster access. Page texts are lazy loaded, they are loaded only when you first access them.

Parameters

doc (Document) – PDF Document for this output device
control (TextControl, optional) – An TextControl object for settings to adjust TextControl extraction/analysis. (default is None)
kwargs – TextControl parameters which will be used if control is not provided.

doc¶

Parent PDF Document

Type: Document, readonly

control¶

Layout settings for output device

Type: TextControl

Raises: XPDFInternalError – If cannot initialize internal xpdf objects will settings provided

get(self, int page_no)¶

Get the extracted UTF-8 decoded str from page_no indexed page

This method is almost similar to get_bytes(), the only difference is that it decodes the extracted bytes in UTF-8 with ‘ignore’ (codecs.ignore_errors()) decoding error handler.

Parameters: page_no (int) – index of page to extract text bytes from
Returns: extracted UTF-8 decoded text
Return type: str

get_all(self) → list ¶

Get the extracted UTF-8 decoded text from all pages

Returns: list of UTF-8 decoded text from all the pages
Return type: list of str

get_bytes(self, int page_no) → bytes ¶

Get the extracted text bytes from page_no indexed page

This method should be use when text encoding (Config.text_encoding) is different than UTF-8 or when you to control decoding of bytes by yourself.

Parameters: page_no (int) – index of page to extract text bytes from
Returns: extracted text bytes
Return type: bytes

TextControl¶

class pyxpdf.xpdf.TextControl¶

Parameters for Text extraction and layout analysis

Text layout modes:

reading
Keep the text in reading order. It ‘undo’ physical layout (columns, hyphenation, etc.) and output the text in reading order.
physical
Maintain (as best as possible) the original physical layout of the text. If the fixed_pitch option is given, character spacing within each line will be determined by the specified character pitch.
table
It is similar to physical layout mode, but optimized for tabular data, with the goal of keeping rows and columns aligned (at the expense of inserting extra whitespace). If the fixed_pitch option is given, character spacing within each line will be determined by the specified character pitch.
simple
Similar to physical layout, but optimized for simple one-column pages. This mode will do a better job of maintaining horizontal spacing, but it will only work properly with a single column of text.
lineprinter
Line printer mode uses a strict fixed character pitch and height layout. That is, the page is broken into a grid, and characters are placed into that grid. If the grid spacing is too small for the actual characters, the result is extra whitespace. If the grid spacing is too large, the result is missing whitespace. The grid spacing can be specified using the fixed_pitch and fixed_line_spacing options. If one or both are not given on the xpdf will attempt to compute appropriate value(s).
raw
Keep the text in content stream order. Depending on how the PDF file was generated, this may or may not be useful.

Parameters

mode ({"reading", "table", "simple", "physical", "lineprinter", "raw"}) – text analysis/extraction layout mode
fixed_pitch (float, optional) – Specify the character pitch (character width), for physical , table ,or lineprinter mode. This is ignored in all other modes. (default is 0, means approximate characters’ pitch will be calculated)
fixed_line_spacing (float, optional) – Specify the line spacing, in points, for lineprinter mode. This is ignored in all other modes. (default is 0, means approximate line spacing will be calculated)
enable_html (bool, optional) – enable extra proccessing for html. (default is False)
clip_text (bool, optional) – Text which is hidden because of clipping is removed before doing layout, and then added back in. This can be helpful for tables where clipped (invisible) text would overlap the next column. (default is False)
discard_clipped (bool, optional) – discard all clipped characters (default is False)
discard_diagonal (bool, optional) – Diagonal text, i.e., text that is not close to one of the 0, 90, 180, or 270 degree axes, is discarded. This is useful to skip watermarks drawn on top of body text, etc. (default is False)
discard_invisible (bool, optional) – discard all invinsible characters (default is False)
insert_bom (bool, optional) – Insert a Unicode byte order marker (BOM) at the start of the text output.
margin_left (float, optional) – Specifies the left margin. Text in the left margin (i.e., within that many points of the left edge of the page) is discarded. (default is 0)
margin_right (float, optional) – Specifies the right margin. Text in the right margin (i.e., within that many points of the right edge of the page) is discarded. (default is 0)
margin_top (float, optional) – Specifies the top margin. Text in the top margin (i.e., within that many points of the top edge of the page) is discarded. (default is 0)
margin_bottom (float, optional) – Specifies the bottom margin. Text in the bottom margin (i.e., within that many points of the bottom edge of the page) is discarded. (default is 0)

Raises

ValueError – If mode invalid

TextOutput Device¶

TextOutput¶

TextControl¶

pyxpdf

Navigation

Related Topics