Convert PDF to HTML in Ubuntu

In this article, we would discuss how we can convert a PDF to HTML in Ubuntu. It can be done with the help of pdftohtml command-line utility. PDF, Portable Document Format, was developed by Adobe in the year 1993.

First, we have to install the package poppler-utils. It is available through Ubuntu’s standard repository. The package has pdftohtml command-line utility.

Note: Following installation operation would require you to have superuser privileges. In case you don’t have one, then we advise you to contact your System Administrator for assistance.

Install poppler-utils in Ubuntu

First update the repository to make the latest version of package available. Hence, issue the following in terminal –

sudo apt update

Next, to install the package poppler-utils

sudo apt install poppler-utils

To verify the installation –

pdftohtml -v

Now, lets say we have a PDF (example.pdf – containing 25 pages). From here on, we would illustrate various pdftohtml options using example.pdf.

Convert PDF to HTML in Ubuntu

A. First, we will see the basic command through which we can convert our example.pdf to f.html. Open a terminal in the directory where we have stored example.pdf.

pdftohtml [options] <pdf-file> [<html-file>]
pdftohtml example.pdf f.html

This will create a html file – f.html in the current directory.

B. But, a drawback of using the above method is that – it doesn’t retain text formatting. So, if we want to retain text formatting then use the following command –

pdftohtml -c example.pdf f.html

where,

-c option basically generates a complex html file.

C. There is one more way to retain text formatting. In this case, the html file would be generated as a single document.

pdftohtml -s example.pdf f.html

Here, a new html file: f-html.html will be generated. It will contain all the pages.

where,

-s option is going to generate a single html file which will contain all the pages.

D. We can also select the range of pages which we want to be converted to html –

pdftohtml -f <int> -l <int> [options] <pdf-file> [<html-file>]

where,

-f <int> start from this page,

-l <int> ends at the page number.

So,

pdftohtml -f 4 -l 10 example.pdf f.html

If in case we want convert example.pdf beginning from a specific page to the end of file. Then, we may skip -l option altogether. For instance, following will convert example.pdf pages beginning from page 6 to the end of file.

pdftohtml -f 6 example.pdf f.html

In conclusion, we have discussed how to convert a PDF to HTML in Ubuntu distribution.

Additional Info –

Besides, we can even convert our PDF to XML(Extensible Markup Language).

pdftohtml -xml <pdf-file> [<xml-file>]

continuing with our example.pdf

pdftohtml -xml example.pdf x.xml