Speed Comparsion¶
Thanks to the brilliant xpdf reader sources and the fact that pyxpdf is written in cython as Python C-API module makes it much faster than pure python based pdf parsers.
Text Extraction¶
Comparing text extraction (while maintaining layout) speed with popular pdfminer.six module. (python script used - compare.py)
Running Python 3.6.9, gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, Ubuntu 18.04, on Azure Standard B2ms (2 vcpus, 8 GiB memory) [Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz]
'pdfminer_text' took: 0.9271 sec
'pyxpdf_text' took: 0.0424 sec
'pdfminer_text_100mb' took: 7.2833 sec
'pyxpdf_text_100mb' took: 0.3301 sec
'pdfminer_text_500mb' took: 36.5288 sec
'pyxpdf_text_500mb' took: 0.9786 sec
Size |
pdfminer.six |
pyxpdf |
times faster |
---|---|---|---|
1 MB |
0.9271 sec |
0.0424 sec |
x21 |
100 MB |
7.2833 sec |
0.3301 sec |
x22 |
500 MB |
36.5288 sec |
0.9786 sec |
x37 |
pyxpdf is atleast x20 times faster