How to Extract Text from Files like PDF, DOC and DOCX, MP3, WAV, JPG, PNG and etc… Using textract

By Fernando Rodrigues February 20, 2018March 8, 2018 In Artificial Intelligence, Machine Learning, Natural Language Processing, Python 0 textract 0

Table of Contents

Introduction

While i was working in a project of artificial intelligence, for natural language processing, i had the need to find a python library that could help me in text extraction (in this case PDF files).

While searching for a library, i could notice that there are lots of options available, like PyPDF2, pdftotext, pdfminer.six…

So, for my happiness, i’ve found the textract library, which provides an extremely simple interface for extracting content from any file type, including images and audio.

Here is the list of files it supports, and the underlying library it uses.

.csv via python builtins
.doc via antiword
.docx via python-docx2txt
.eml via python builtins
.epub via ebooklib
.gif via tesseract-ocr
.jpg and .jpeg via tesseract-ocr
.json via python builtins
.html and .htm via beautifulsoup4
.mp3 via sox, SpeechRecognition, and pocketsphinx
.msg via msg-extractor
.odt via python builtins
.ogg via sox, SpeechRecognition, and pocketsphinx
.pdf via pdftotext (default) or pdfminer.six
.png via tesseract-ocr
.pptx via python-pptx
.ps via ps2text
.rtf via unrtf
.tiff and .tif via tesseract-ocr
.txt via python builtins
.wav via SpeechRecognition and pocketsphinx
.xlsx via xlrd
.xls via xlrd

This is something amazing, because i found not only a good wrapper, but i now have access to a list of packages that i can use for text extraction and textual analysis and visualization.

Installing textract on Mac

Open the Terminal and execute the commands bellow. Here i am assuming you have Homebrew installed.

brew cask install xquartz
brew install poppler antiword unrtf tesseract swig sox
pip3 install textract

How to use textract for PDF, MP3, PNG text extraction

There is two ways to use textract, one is from the CLI and the other is using a python package. In this example the CLI will be used. For python projects, just import textract. and use textract.process(‘path/of/file.extension’)

Example in vídeo:

Conclusion

Textract makes text extraction a breeze. I need further tests to know for sure if the underlying libs it uses are the best options possible (also because best is subjective…), but what excites me most is the fact that i could even tune the underlying libs, the way i want.

How to Extract Text from Files like PDF, DOC and DOCX, MP3, WAV, JPG, PNG and etc… Using textract

Introduction

Installing textract on Mac

How to use textract for PDF, MP3, PNG text extraction

Conclusion

Using Python NLTK (Natural Language Toolkit)

Understanding Machine Learning and Its Process

What do you think? Cancel reply