In this article, we would discuss how we can convert a PDF to HTML in Ubuntu. It can be done with the help of pdftohtml command-line utility. PDF, Portable Document Format, was developed by Adobe in the year 1993.
First, we have to install the package poppler-utils. It is available through Ubuntu’s standard repository. The package has pdftohtml command-line utility.
Note: Following installation operation would require you to have superuser privileges. In case you don’t have one, then we advise you to contact your System Administrator for assistance.
Install poppler-utils in Ubuntu
First update the repository to make the latest version of package available. Hence, issue the following in terminal –
sudo apt update
Next, to install the package poppler-utils
sudo apt install poppler-utils
To verify the installation –
pdftohtml -v
Now, lets say we have a PDF (example.pdf – containing 25 pages). From here on, we would illustrate various pdftohtml options using example.pdf.
Convert PDF to HTML in Ubuntu
A. First, we will see the basic command through which we can convert our example.pdf to f.html. Open a terminal in the directory where we have stored example.pdf.
pdftohtml [options] <pdf-file> [<html-file>]
pdftohtml example.pdf f.html
This will create a html file – f.html in the current directory.
B. But, a drawback of using the above method is that – it doesn’t retain text formatting. So, if we want to retain text formatting then use the following command –
pdftohtml -c example.pdf f.html
where,
-c option basically generates a complex html file.
C. There is one more way to retain text formatting. In this case, the html file would be generated as a single document.
pdftohtml -s example.pdf f.html
Here, a new html file: f-html.html will be generated. It will contain all the pages.
where,
-s option is going to generate a single html file which will contain all the pages.
D. We can also select the range of pages which we want to be converted to html –
pdftohtml -f <int> -l <int> [options] <pdf-file> [<html-file>]
where,
-f <int> start from this page,
-l <int> ends at the page number.
So,
pdftohtml -f 4 -l 10 example.pdf f.html
If in case we want convert example.pdf beginning from a specific page to the end of file. Then, we may skip -l option altogether. For instance, following will convert example.pdf pages beginning from page 6 to the end of file.
pdftohtml -f 6 example.pdf f.html
In conclusion, we have discussed how to convert a PDF to HTML in Ubuntu distribution.
Additional Info –
Besides, we can even convert our PDF to XML(Extensible Markup Language).
pdftohtml -xml <pdf-file> [<xml-file>]
continuing with our example.pdf –
pdftohtml -xml example.pdf x.xml