Speed Comparsion¶

Thanks to the brilliant xpdf reader sources and the fact that pyxpdf is written in cython as Python C-API module makes it much faster than pure python based pdf parsers.

Text Extraction¶

Comparing text extraction (while maintaining layout) speed with popular pdfminer.six module. (python script used - compare.py)

Running Python 3.6.9, gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, Ubuntu 18.04, on Azure Standard B2ms (2 vcpus, 8 GiB memory) [Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz]

'pdfminer_text' took: 0.9271 sec
'pyxpdf_text' took: 0.0424 sec

'pdfminer_text_100mb' took: 7.2833 sec
'pyxpdf_text_100mb' took: 0.3301 sec

'pdfminer_text_500mb' took: 36.5288 sec
'pyxpdf_text_500mb' took: 0.9786 sec

Size	pdfminer.six	pyxpdf	times faster
1 MB	0.9271 sec	0.0424 sec	x21
100 MB	7.2833 sec	0.3301 sec	x22
500 MB	36.5288 sec	0.9786 sec	x37

pyxpdf is atleast x20 times faster