Table of Contents
Introduction
While i was working in a project of artificial intelligence, for natural language processing, i had the need to find a python library that could help me in text extraction (in this case PDF files).
While searching for a library, i could notice that there are lots of options available, like PyPDF2, pdftotext, pdfminer.six…
So, for my happiness, i’ve found the textract library, which provides an extremely simple interface for extracting content from any file type, including images and audio.
Here is the list of files it supports, and the underlying library it uses.
.csvvia python builtins.docvia antiword.docxvia python-docx2txt.emlvia python builtins.epubvia ebooklib.gifvia tesseract-ocr.jpgand.jpegvia tesseract-ocr.jsonvia python builtins.htmland.htmvia beautifulsoup4.mp3via sox, SpeechRecognition, and pocketsphinx.msgvia msg-extractor.odtvia python builtins.oggvia sox, SpeechRecognition, and pocketsphinx.pdfvia pdftotext (default) or pdfminer.six.pngvia tesseract-ocr.pptxvia python-pptx.psvia ps2text.rtfvia unrtf.tiffand.tifvia tesseract-ocr.txtvia python builtins.wavvia SpeechRecognition and pocketsphinx.xlsxvia xlrd.xlsvia xlrd
This is something amazing, because i found not only a good wrapper, but i now have access to a list of packages that i can use for text extraction and textual analysis and visualization.
Installing textract on Mac
Open the Terminal and execute the commands bellow. Here i am assuming you have Homebrew installed.
brew cask install xquartz brew install poppler antiword unrtf tesseract swig sox pip3 install textract
How to use textract for PDF, MP3, PNG text extraction
There is two ways to use textract, one is from the CLI and the other is using a python package. In this example the CLI will be used. For python projects, just import textract. and use textract.process(‘path/of/file.extension’)
Example in vídeo:
Conclusion
Textract makes text extraction a breeze. I need further tests to know for sure if the underlying libs it uses are the best options possible (also because best is subjective…), but what excites me most is the fact that i could even tune the underlying libs, the way i want.



What do you think?