Convert PDF to text in Ubuntu

Published: July 24, 2022

Convert PDF to text in Ubuntu

In this article, we would cover how to convert PDF to text in Ubuntu. PDF stands for Portable Document Format. It would contain text, images and multimedia objects. So, if our PDF file contains images, multimedia objects and those objects have text in them. Then, what we are about to cover won't convert text embedded in images/multimedia objects.

pdftotext - is the command-line utility which is used to extract text from PDFs. It is available through package - poppler-utils. Therefore, we will first cover the installation steps for the package.

Note: Following operations would require you to have superuser privileges. In case you don't have one then, we advise you to contact your System Administrator for assistance.

Install poppler-utils in Ubuntu

Since, the package is already available through standard Ubuntu repository. Therefore, first update the repository to ensure we get to have the latest version of the package available. Hence, open a terminal and issue the following -

sudo apt update

Next, to install poppler-utils package -

sudo apt install poppler-utils

We can use pdftotext command-line utility now.

Convert PDF to text in Ubuntu

The following is the syntax for pdftotext utility -

pdftotext [options] [PDF_file] [text_file]

Let's say we have a PDF file - test.pdf and resultant file as - out.txt

Now, if we want to convert all the pages of PDF file to text file then use the following code -

pdftotext test.pdf out.txt

We can also specify the first page to convert through -f option -

pdftotext -f 4 test.pdf out.txt

It would start to convert the PDF from fourth page till the last page.

Similarly, for last page to convert (-l option) -

pdftotext -l 3 test.pdf out.txt

It would convert the first three pages.

Furthermore, we can also use a combination -

pdftotext -f 2 -l 5 test.pdf out.txt

It will convert pages between 2 and 5.

Apart from that, we can also set the encoding (default value is UTF-8) for text file. This can be done through -

pdftotext -enc <encoding_name> test.pdf out.txt

In conclusion, we have covered how to convert PDF to text in Ubuntu here.