Speed Comparsion

Thanks to the brilliant xpdf reader sources and the fact that pyxpdf is written in cython as Python C-API module makes it much faster than pure python based pdf parsers.

Text Extraction

Comparing text extraction (while maintaining layout) speed with popular pdfminer.six module. (python script used - compare.py)

Running Python 3.6.9, gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, Ubuntu 18.04, on Azure Standard B2ms (2 vcpus, 8 GiB memory) [Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz]

'pdfminer_text' took: 0.9271 sec
'pyxpdf_text' took: 0.0424 sec

'pdfminer_text_100mb' took: 7.2833 sec
'pyxpdf_text_100mb' took: 0.3301 sec

'pdfminer_text_500mb' took: 36.5288 sec
'pyxpdf_text_500mb' took: 0.9786 sec

Size

pdfminer.six

pyxpdf

times faster

1 MB

0.9271 sec

0.0424 sec

x21

100 MB

7.2833 sec

0.3301 sec

x22

500 MB

36.5288 sec

0.9786 sec

x37

pyxpdf is atleast x20 times faster