Sign up with your email address to be the first to know about new products, VIP offers, blog features & more.

How to Extract Text from Files like PDF, DOC and DOCX, MP3, WAV, JPG, PNG and etc… Using textract

 

Introduction

 

While i was working in a project of artificial intelligence, for natural language processing, i had the need to find a python library that could help me in text extraction (in this case PDF files).

While searching for a library, i could notice that there are lots of options available, like PyPDF2, pdftotext, pdfminer.six…

So, for my happiness, i’ve found the textract library, which provides an extremely simple interface for extracting content from any file type, including images and audio.

Here is the list of files it supports, and the underlying library it uses.

 

 

This is something amazing, because i found not only a good wrapper, but i now have access to a list of packages that i can use for text extraction and textual analysis and visualization.

 

Installing textract on Mac

Open the Terminal and execute the commands bellow. Here i am assuming you have Homebrew installed.

brew cask install xquartz
brew install poppler antiword unrtf tesseract swig sox
pip3 install textract

 

How to use textract for PDF, MP3, PNG text extraction

There is two ways to use textract, one is from the CLI and the other is using a python package. In this example the CLI will be used. For python projects, just import textract. and use textract.process(‘path/of/file.extension’)

Example in vídeo:

 

 

Conclusion

Textract makes text extraction a breeze. I need further tests to know for sure if the underlying libs it uses are the best options possible (also because best is subjective…), but what excites me most is the fact that i could even tune the underlying libs, the way i want.

No Comments Yet.

What do you think?

Your email address will not be published. Required fields are marked *