View PDFs with a browser using pdftohtml

178

Author: Rob Reilly

I picked up the phone, and the voice at the other end said, “Rob, can you help me with a PDF file? I can’t read it.” It was my Beverly Hills attorney brother, who had received a 1MB PDF file via his Web-based email client and didn’t know how to download it and read it, especially since he had no Acrobot reader installed. After some mutual grumbling about Acrobat being an unnecessary step in the document-reading process, I discovered a fairly painless solution to the whole mess, thanks to a little-known program called pdftohtml.

pdftohtml converts a regular .pdf file into a series of Portable Network Graphics (.png) files, indexes them with matching HTML files, then packages the whole thing up into a convenient main HTML file. You can view the main file with any Web browser and click on the links to the individual pages.

To begin using pdftohtml, the first thing you need to do is download the tar.gz file and install it. The latest version, as of this article, is 0.36. I’m running SUSE Linux 8.2 Professional on my laptop, so I loaded pdftohtml version 0.35-22 from the SUSE CDs with YAST2.

Putting pdftohtml to work

pdftohtml runs from the command line with various options. The basic form of the command is:

     rreilly> pdftohtml [pdf file name]

This command gives you a simple HTML file suitable for reading or copying the textual content of the PDF file. You can actually grab the text from your browser and paste it into other applications. It doesn’t produce any PNG files, so you won’t be able to see any embedded graphics. It’s a great utility if you just want to extract the text from an Adobe file.

If you want to see graphics, you’ll need to use the -c (as in “complex”) option:

     rreilly> pdftohtml -c [pdf file name]

This option produces individual HTML files, one for each page of the PDF file, with the PNG references mixed in. The graphics in the original PDF file show up in a browser and the text part can be cut and pasted. The total size of the HTML and PNG files generated with the -c option tend to be roughly equivalent to that of the original PDF.

I’ve used pdftohtml on a variety of files, including OpenOffice.org Impress-generated PDFs. The output was of good quality and even rendered in color.

There are a few limitations that you should be aware of.

  • If the quality of the text and graphics in the PDF are poor, as in a scanned page, the quality of the resultant Web pages may be unusable. One way around the poor quality is to load the generated PNG file into the Gimp and boost the contrast. You might also try some of the built-in enhancing filters.
  • Another limitation is that the pdftohtml converter, by default, zooms the resultant PNG files by 1.5 times. That meant that my converted Impress slides were too big for the viewable screen in my browser; I had to scroll left and right or up and down to view the whole slide, and that was with Mozilla and a 1024×768 resolution setting! It’s an easy fix — just set zoom to 1.0 by modifying the command:

         rreilly> pdftohtml -c -zoom 1.0 [pdf file name]

  • You might notice some minor format differences from the original, such as the line spacing being a little off or the font reverting to some default type. Mozilla (version 1.6) seemed to render the generated HTML and PNGs very well. I looked at the same Web pages with Konqueror 3.3.1 and noticed a huge difference. Perhaps a later version would work better.
  • pdftohtml can take some time to complete, especially if the PDF file has a lot of pages and you use the -c option. On my old 300MHz laptop it took about five minutes to render a 60-page PDF document into HTML with PNG graphics.

There are several other command-line options that you can try, such as, -noframes and -nomerge. Type pdftohtml without an argument for a list of options.

Upload converted pages to your Web server

Now that we’ve created the HTML and PNGs from our PDF file, the last step is to put them on a server so others can view them.

If you are going to send your files to your ISP’s Web server, you can just use gftp or another FTP program to upload your HTML and PNG files.

I used scp to copy the files from my laptop to my Web-facing Apache server. It’s a little more work, but I enjoy typing. Right…

I also created directory structures on my server (each named after the original PDF), because I thought I would have several other documents to convert. pdftohtml names the converted files after whatever is on the left side of the original PDF file name. Although you can dump all your converted files into one directory, that can get confusing very quickly. They’re easier to manage if each document has its own conversion directory. To make it really convenient, you could also write up a quick HTML menu document and give its URL to your users.

Problem solved

It really wasn’t that much work to get my brother’s PDF file into a form that he could read. It’s a handy little process that might help you out some time. To summarize:

He didn’t have to worry about downloading any files. I just emailed him the URL to the server.

He wouldn’t go over his space limit on his mail account.

Since each page was HTML and a PNG, it loaded pretty quickly into his browser and he could jump around, from page to page, all he wanted.

Lastly, he could keep his Windows XP box pure and free of the Acrobat reader.

Rob Reilly is a professional technology writer and consultant whose articles appear in various Linux media outlets. He offers professional writing and seminar services on Linux desktop applications, portable computing, and presentation technology. He’s always interested in covering cool Linux stories. Send him a note or visit his Web site.