TextOutput Device¶
In TextOutput
output Device we use
TextControl
to set settings for
text extraction/analysis
TextOutput¶
-
class
pyxpdf.xpdf.
TextOutput
¶ Text extract/analysis PDF Output device
Extract text and do layout analysis on from PDF
Document
while caching results. Page texts are cached for faster access. Page texts are lazy loaded, they are loaded only when you first access them.- Parameters
doc (Document) – PDF Document for this output device
control (TextControl, optional) – An
TextControl
object for settings to adjust TextControl extraction/analysis. (default isNone
)kwargs –
TextControl
parameters which will be used if control is not provided.
-
control
¶ Layout settings for output device
- Type
- Raises
XPDFInternalError – If cannot initialize internal xpdf objects will settings provided
-
get
(self, int page_no)¶ Get the extracted UTF-8 decoded
str
from page_no indexed pageThis method is almost similar to
get_bytes()
, the only difference is that it decodes the extracted bytes in UTF-8 with ‘ignore’ (codecs.ignore_errors()
) decoding error handler.
-
get_all
(self) → list¶ Get the extracted UTF-8 decoded text from all pages
- Returns
list of UTF-8 decoded text from all the pages
- Return type
list
of str
-
get_bytes
(self, int page_no) → bytes¶ Get the extracted text bytes from page_no indexed page
This method should be use when text encoding (
Config.text_encoding
) is different than UTF-8 or when you to control decoding of bytes by yourself.
TextControl¶
-
class
pyxpdf.xpdf.
TextControl
¶ Parameters for Text extraction and layout analysis
- Text layout modes:
- reading
Keep the text in reading order. It ‘undo’ physical layout (columns, hyphenation, etc.) and output the text in reading order.
- physical
Maintain (as best as possible) the original physical layout of the text. If the fixed_pitch option is given, character spacing within each line will be determined by the specified character pitch.
- table
It is similar to physical layout mode, but optimized for tabular data, with the goal of keeping rows and columns aligned (at the expense of inserting extra whitespace). If the fixed_pitch option is given, character spacing within each line will be determined by the specified character pitch.
- simple
Similar to physical layout, but optimized for simple one-column pages. This mode will do a better job of maintaining horizontal spacing, but it will only work properly with a single column of text.
- lineprinter
Line printer mode uses a strict fixed character pitch and height layout. That is, the page is broken into a grid, and characters are placed into that grid. If the grid spacing is too small for the actual characters, the result is extra whitespace. If the grid spacing is too large, the result is missing whitespace. The grid spacing can be specified using the fixed_pitch and fixed_line_spacing options. If one or both are not given on the xpdf will attempt to compute appropriate value(s).
- raw
Keep the text in content stream order. Depending on how the PDF file was generated, this may or may not be useful.
- Parameters
mode ({"reading", "table", "simple", "physical", "lineprinter", "raw"}) – text analysis/extraction layout mode
fixed_pitch (float, optional) – Specify the character pitch (character width), for physical , table ,or lineprinter mode. This is ignored in all other modes. (default is 0, means approximate characters’ pitch will be calculated)
fixed_line_spacing (float, optional) – Specify the line spacing, in points, for lineprinter mode. This is ignored in all other modes. (default is 0, means approximate line spacing will be calculated)
enable_html (bool, optional) – enable extra proccessing for html. (default is
False
)clip_text (bool, optional) – Text which is hidden because of clipping is removed before doing layout, and then added back in. This can be helpful for tables where clipped (invisible) text would overlap the next column. (default is
False
)discard_clipped (bool, optional) – discard all clipped characters (default is
False
)discard_diagonal (bool, optional) – Diagonal text, i.e., text that is not close to one of the 0, 90, 180, or 270 degree axes, is discarded. This is useful to skip watermarks drawn on top of body text, etc. (default is
False
)discard_invisible (bool, optional) – discard all invinsible characters (default is
False
)insert_bom (bool, optional) – Insert a Unicode byte order marker (BOM) at the start of the text output.
margin_left (float, optional) – Specifies the left margin. Text in the left margin (i.e., within that many points of the left edge of the page) is discarded. (default is 0)
margin_right (float, optional) – Specifies the right margin. Text in the right margin (i.e., within that many points of the right edge of the page) is discarded. (default is 0)
margin_top (float, optional) – Specifies the top margin. Text in the top margin (i.e., within that many points of the top edge of the page) is discarded. (default is 0)
margin_bottom (float, optional) – Specifies the bottom margin. Text in the bottom margin (i.e., within that many points of the bottom edge of the page) is discarded. (default is 0)
- Raises
ValueError – If mode invalid