Getting started =============== ``pyxpdf`` is a wrapper on `xpdf reader `_ sources. It aims to provide a fast and memory efficient pdf parser with easy to use API. Installation ------------ .. code-block:: bash pip install pyxpdf For additional encodings support, install optional dependency `pyxpdf_data `_ .. code-block:: bash pip install pyxpdf_data For Image extraction and pdf to image support, install optional dependency `Pillow `_ .. code-block:: bash pip install Pillow Quick Start ----------- ``pyxpdf`` use :class:`~pyxpdf.xpdf.Document` to represent and load a PDF file. Similary :class:`~pyxpdf.xpdf.Page` for PDF Page. All the xpdf related settings can be accessed with global :data:`~pyxpdf.xpdf.Config` object. .. code-block:: python from pyxpdf import Document, Page, Config from pyxpdf.xpdf import TextControl doc = Document("samples/nonfree/mandarin.pdf") # or # load pdf from file like object with open("samples/nonfree/mandarin.pdf", 'rb') as fp: doc = Document(fp) # get pdf metadata dict print(doc.info()) # >>> doc.info() # {'CreationDate': "D:20080721141207-04'00'", # 'Subject': 'Chinese Version of Universal PCXR8 ...', # 'Author': 'SKC Inc.', # 'Creator': 'PScript5.dll # ..... # get all text all_text = doc.text() # iter first 10 pages for page in doc[:10]: # get page label if any print(page.label) # get page by page label label_page = doc['1'] # get text in table layout without discarding clipped # text. text_control = TextControl("table", discard_clipped=True) text = label_page.text(control=text_control) # find case sensitive text within [x_min, y_min, x_max, y_max] res_box = label_page.find_text('操作说明', search_box=[0, 0, 400, 400], case_sensitive=True) # >>> print(res_box) # (281.88, 269.718, 354.05819999999994, 287.7) # load xpdfrc Config.load_file('my_xpdfrc') # suppress stderr output for xpdf error log. Config.error_quiet = False Checkout :doc:`api/index` for more details. .. todo:: Add bechmark and speed comparison with python pdf modules