Table of Contents
Introduction
While i was working in a project of artificial intelligence, for natural language processing, i had the need to find a python library that could help me in text extraction (in this case PDF files).
While searching for a library, i could notice that there are lots of options available, like PyPDF2, pdftotext, pdfminer.six…
So, for my happiness, i’ve found the textract library, which provides an extremely simple interface for extracting content from any file type, including images and audio.
Here is the list of files it supports, and the underlying library it uses.
.csv
via python builtins.doc
via antiword.docx
via python-docx2txt.eml
via python builtins.epub
via ebooklib.gif
via tesseract-ocr.jpg
and.jpeg
via tesseract-ocr.json
via python builtins.html
and.htm
via beautifulsoup4.mp3
via sox, SpeechRecognition, and pocketsphinx.msg
via msg-extractor.odt
via python builtins.ogg
via sox, SpeechRecognition, and pocketsphinx.pdf
via pdftotext (default) or pdfminer.six.png
via tesseract-ocr.pptx
via python-pptx.ps
via ps2text.rtf
via unrtf.tiff
and.tif
via tesseract-ocr.txt
via python builtins.wav
via SpeechRecognition and pocketsphinx.xlsx
via xlrd.xls
via xlrd
This is something amazing, because i found not only a good wrapper, but i now have access to a list of packages that i can use for text extraction and textual analysis and visualization.
Installing textract on Mac
Open the Terminal and execute the commands bellow. Here i am assuming you have Homebrew installed.
brew cask install xquartz brew install poppler antiword unrtf tesseract swig sox pip3 install textract
How to use textract for PDF, MP3, PNG text extraction
There is two ways to use textract, one is from the CLI and the other is using a python package. In this example the CLI will be used. For python projects, just import textract. and use textract.process(‘path/of/file.extension’)
Example in vídeo:
Conclusion
Textract makes text extraction a breeze. I need further tests to know for sure if the underlying libs it uses are the best options possible (also because best is subjective…), but what excites me most is the fact that i could even tune the underlying libs, the way i want.
What do you think?