Linux.com

Feature

Manipulating PDFs with the PDF Toolkit

By Scott Nesbitt on April 27, 2006 (8:00:00 AM)

Share    Print    Comments   

Creating and reading PDF files in Linux is easy, but manipulating existing PDF files is a little trickier. Countless applications enable you to fiddle with PDFs, but it's hard to find a single application that does everything. The PDF Toolkit (pdftk) claims to be that all-in-one solution. It's the closest thing to Adobe Acrobat that I've found for Linux.

Developer Sid Steward describes pdftk as the PDF equivalent of an "electronic staple remover, hole punch, binder, secret decoder ring, and X-ray glasses." That's a lot of functionality for a 4MB application, but the software delivers. Pdftk can join and split PDFs; pull single pages from a file; encrypt and decrypt PDF files; add, update, and export a PDF's metadata; export bookmarks to a text file; add or remove attachments to a PDF; fix a damaged PDF; and fill out PDF forms. In short, there's very little pdftk can't do when it comes to working with PDFs.

You can download pdftk 1.12 as source or as a Debian or RPM package, FreeBSD port, or Gentoo Ebuild. Binaries are available for Windows and Mac OS X too. If you decide to compile pdftk, as I did, check the build notes before you begin, in order to find out about any dependencies for your Linux distro or your platform. The compilation process only took a few minutes on my computer, and there were no hitches.

Pdftk is a command-line tool, and the syntax can be complicated, especially for complex actions such as removing specific pages from a PDF file. You can expect to do a lot of typing, but that shouldn't put you off using the tool.

I put pdftk through its paces with a number of PDFs that ranged in size from 30KB to 2MB. I focused on the functions that I use most with other PDF software: joining and splitting PDFs, removing pages from a PDF, and attaching files to a PDF. Except for one or two very minor issues, I wasn't disappointed with the results. Pdftk also produced output far more quickly than most other PDF tools that I've worked with.

Joining files

Pdftk's ability to join two or more PDF files is on par with such specialized applications as pdfmeld and joinPDF (discussed in this article). The command syntax is simple:

pdftk file1.pdf file2.pdf cat output newFile.pdf

cat is short for concatenate -- that is, link together, for those of us who speak plain English -- and output tells pdftk to write the combined PDFs to a new file.

Pdftk doesn't retain bookmarks, but it does keep hyperlinks to both destinations within the PDF and to external files or Web sites. Where some other applications point to the wrong destinations for hyperlinks, the links in PDFs combined using pdftk managed to hit each link target perfectly.

Splitting files

Splitting PDF files with pdftk was an interesting experience. The burst option breaks a PDF into multiple files -- one file for each page:

pdftk user_guide.pdf burst

I don't see the use of doing that, and with larger documents you wind up with a lot of files with names corresponding to their page numbers, like pg_0001 and pg_0013 -- not very intuitive.

On the other hand, I found pdftk's ability to remove specific pages from a PDF file to be useful. For example, to remove pages 10 to 25 from a PDF file, you'd type the following command:

pdftk myDocument.pdf cat 1-9 26-end output removedPages.pdf

I have used this syntax extensively to trim pages from work samples that I have posted on my company's Web site, and to extract articles from back issues of a magazine to which I contribute. The resulting files are small, and the PDFs retain excellent resolution.

Adding attachments

When I moved to Linux from Windows in 1999, I missed Adobe Acrobat's ability to attach files to a PDF. I regularly used this feature to include addenda, surveys, or additional information with a published PDF. Until I found pdftk, I was forced to move my PDF documents to a Windows box whenever I needed to attach a file.

Why attach a file to a PDF instead of sending an archive? The major appeal is convenience. If you move a PDF from one computer to another, and don't move the archive along with it, you won't have access to the attachments. And instead of pulling a file from an archive to view it, you just double-click on the attachment's icon to open the file from your PDF viewer.

Pdftk can attach binary and text files to a PDF with ease. You can even specify what page of the PDF you want the attachment to appear on. For example:

pdftk html_tidy.pdf attach_files command_ref.html to_page 24 output html_tidy_book.pdf

I have attached OpenOffice.org Writer documents, tar.gz archives, and text and HTML files to various PDF documents, and aside from a noticeable increase in the size of the PDF file, there were no nasty side effects.

Attached files are denoted by a thumbtack icon in the PDF, but only in Adobe's Acrobat Reader. Attachments don't appear in Xpdf, Evince, KPDF, or gv.

Filling out forms

Most PDF files are static -- you read them, print them out, or copy text from them. But PDFs can also be interactive. It's possible to create PDF forms with fields that accept information. Companies and government departments post PDF forms on their Web sites to collect survey information and customer feedback, and even to submit tax returns.

Using pdftk's fill_form option, you can fill out forms using information in a separate file. However, the fill_form option isn't for the faint of heart. To perform this task, you need to create a Form Data Format (FDF) file containing the data that you want to merge into the form. You can do this using pdftk's generate_fdf directive.

The FDF file contains the names of each field in the PDF and the values you want to enter into those fields. The FDF file also contains a link to the name of the PDF form. An FDF file looks something like this:

%FDF-1.2
1 0 obj
<< /FDF
  << /Fields
   [ << /T (Name_field) /V (Fred Langan) >>
     << /T (Address_field) /V (1313 Mockingbird Lane) >>
     << /T (Age_field) /V (53) >>]
     /F (info_form.pdf)
  >>
>>
endobj
trailer
<< /Root 1 0 R >>
%%EOF

To fill out the form using an FDF file, use a command like this:

pdftk survey_form.pdf fill_form survey_answers.fdf output filled_survey.pdf

Unless you're comfortable creating FDF files, the fill_form option isn't really suited for completing the odd form here and there. However, if you're feeling adventurous, the book PDF Hacks explains how to use pdftk and a Web server running PHP to do this with Web-based forms.

A couple of infrequently used options

Pdftk has a number of options that you might use infrequently, but that are very useful when you need them -- such as update_info and user_pw.

When you create a PDF, it might contain no or incomplete metadata -- that is, information describing the PDF. Metadata can come in handy when you or your users need to organize or index a set of PDF files. Using pdftk and a text file, you can change or add metadata to the PDF:

pdftk DocBook_Overview.pdf update_info data.txt output DocBookOverview.pdf

In this usage, the contents of the file data.txt consist of an InfoKey and InfoValue pair, like this:

InfoKey: Keywords
InfoValue: DocBook,writing,documentation,background

You can change only the following metadata items with pdftk: title, author, subject, producer, and keywords.

If you're working with PDFs that contain sensitive information, you may want to require a password to read the PDF. If you want to make sure that only certain people can view a PDF, you can apply a password to it with the user_pw option:

pdftk sales_report.pdf output SalesReport.pdf user_pw PROMPT

You will be prompted for a password of up to 32 characters. When someone tries to open the PDF, they will be asked to enter a password.

If you use pdftk regularly, or if you're comfortable writing scripts to encapsulate the commands that you use, then you should have no problems working from the command line. Otherwise, check out Dirk Paehl's graphical front end for pdftk, GUI for PDFTK. It isn't the prettiest or most intuitive GUI around, but it does give you quick access to all of pdftk's functions.

Conclusion

Pdftk is one of the most useful tools for manipulating PDF files. It does as good a job as the single-function PDF tools available for Linux, and often the results are better.

Pdftk's flexibility is unmatched on Linux. While it's not the easiest software, with a bit of practice you'll get the hang of it. The pdftk Web site contains a number of useful tips and tricks.

Chances are you'll use only a handful of pdftk's features regularly. But when you need to call on some of pdftk's other functions, for things like repairing a PDF file or filling out PDF forms, you'll be glad you have this application on your hard drive.

Scott Nesbitt is a technical writer and journalist who spends way too much time fooling around with PDFs (and other types of documents).

Scott Nesbitt is a freelance journalist and technical writer based in Toronto, Canada.

Share    Print    Comments   

Comments

on Manipulating PDFs with the PDF Toolkit

Note: Comments are owned by the poster. We are not responsible for their content.

Files for guipdftk?

Posted by: Anonymous Coward on April 27, 2006 09:10 PM
Your article was great and I have used pdftk for some time my self. The gui sounds useful but the link you gave seems dead. If you could post a linux version somewhere it would be much appreciated. Debian would be even better<nobr> <wbr></nobr>;-)

#

pdf split

Posted by: Anonymous Coward on April 27, 2006 11:24 PM
a great program. Using it for quite a while on OSX Panther and Tiger in production environment. Works extremely well and handles correctly spot colors and fonts. It has an option not only to burst the output but also to rename it correctly.

pdftk myfile.pdf burst output myNEWname_"percent sign"03d.pdf

this would burst the file and output in a new name with keeping always 3 spaces for number - like 003, 033, 133 so they are in correct order on the sort.

bojidar

#

Can it split a pdf?

Posted by: Anonymous Coward on April 28, 2006 06:21 AM
The biggest gap in pdftk seems to be the simple one of splitting a pdf into two at a particular page. At least, I haven't been able to find a way of doing this.

#

Re:Can it split a pdf?

Posted by: Anonymous Coward on April 28, 2006 10:00 AM
Well, there's no single command that I'm aware of to do this, as pdftk likes to output a single file. However, two simple commands will do the trick:


    pdftk book.pdf cat 1-4 output part1.pdf

    pdftk book.pdf cat 5-end output part2.pdf


    --kirby

#

Re:Can it split a pdf?

Posted by: Anonymous Coward on April 28, 2006 08:39 PM
Thanks, Kirby. That's useful to know. I suppose there are benefits in keeping the original file intact.

#

Page numbers

Posted by: Anonymous Coward on April 28, 2006 09:45 PM
Can anyone suggest a way of adding page numbers to pdf document (preferably on Linux if any)

#

Re:Page numbers

Posted by: Anonymous Coward on May 08, 2006 03:59 AM
Yes, you can add page numbers to a PDF with mbtPdfAsm:
<a href="http://thierry.schmit.free.fr/dev/mbtPdfAsm/enMbtPdfAsm2.html" title="schmit.free.fr">http://thierry.schmit.free.fr/dev/mbtPdfAsm/enMbt<nobr>P<wbr></nobr> dfAsm2.html</a schmit.free.fr>

Juanan

#

Uncompress is handy, too

Posted by: Anonymous Coward on April 29, 2006 12:02 AM
I found the uncompress feature to be handy, since I can then do funky things to the text. I used it to hack together a program to extract single italic characters from the Da Vinci Code court ruling yesterday.

#

Missing page rotations

Posted by: Anonymous Coward on April 29, 2006 07:33 PM
I miss an option to rotate a page in PDF document by 90, 180 or 270 degrees. Our printer/copier can create PDF files and I need often to rotate resulted PDF document in 90, 180 or 270 degrees as original documents were scanned with wrong orientation. I can rotate page in PDF reader but I cannot store the changes to PDF document.

#

Re:Missing page rotations

Posted by: Anonymous Coward on May 11, 2006 12:42 AM
This one had me stumped for a while. Finally solved it using sed.

Uncompress the document:
<tt>pdftk input.pdf output output.pdf uncompress</tt>
Replace all Rotate 0 entries with Rotate 90, 180, or whatever:
<tt>sed -i "s/Rotate 0/Rotate 90/g" output.pdf</tt>
Recompress the document if desired:
<tt>pdftk output.pdf output final.pdf compress</tt>
All done. If you want to only rotate a certain page or pages, you'll need to extract those pages to a seperate pdf first, rotate those, then recombine.

I feel really clever right now, hope this helps someone.



Niosop

#

Re:Missing page rotations

Posted by: Anonymous Coward on May 18, 2006 10:04 PM
dude u rock

#

Re:Missing page rotations

Posted by: Anonymous Coward on June 06, 2006 11:46 PM
And Google said: "Search an ye shall receive...."
Thanks!

#

Re:Missing page rotations

Posted by: Administrator on August 18, 2006 04:04 AM
I tried this with PDF files I get via an E-mail FAX service. Unfortunately there are not "Rotate" statements in the file even after uncompressing.

Is there possibly another method, without extracting the images, rotating and rebuilding the PDF?

#

http://painrelief.fanspace.com/index.htm

Posted by: Anonymous Coward on May 28, 2006 06:31 PM
[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]
[URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]
[URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.com] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/painrelief<nobr>.<wbr></nobr> htm] Nerve pain relief [/URL]

#

Re:Missing page rotations

Posted by: Anonymous Coward on August 29, 2006 12:04 AM
You could probably just add the Rotate statements in. You'd have to figure out where it wants them though. I'd see if you can find a PDF that does have Rotate statements in it, and try and find a pattern as to where rotate statements are allowed. Since it's a FAX to PDF conversion, I think a single Rotate statement would probably work, and the format is most likely consistant between faxes, so it would be easy enough to script.

But extracting the image, rotating and rebuilding the PDF should be really easy w/ a quick script as well, were you having image quality problems when doing it this way?

Niosop

#

Re:Missing page rotations

Posted by: Administrator on August 29, 2006 07:14 AM
Ok, I figured out how to rotate FAXes I receive as PDFs. The FAX itself is an image inside the PDF and the files I receive from the provider don't contain any<nobr> <wbr></nobr>/Rotate statements.

After a lot of trial and error I found the solution to reversing a FAX that came in upside down. Locate lines like this:<nobr> <wbr></nobr>/MediaBox [0 0 609 734]

Change it to:<nobr> <wbr></nobr>/MediaBox [0 0 609 734]<nobr> <wbr></nobr>/Rotate 180

This should appear once per page.

#

Re:Missing page rotations

Posted by: Anonymous Coward on April 12, 2007 05:48 PM
Try:
pdftk in.pdf cat 1E output out.pdf
to rotate 90 degrees (or 1S for 180 or 1W for -90)

see:
<a href="http://www.pdfhacks.com/pdftk/" title="pdfhacks.com">http://www.pdfhacks.com/pdftk/</a pdfhacks.com>

#

Re:Missing page rotations

Posted by: Administrator on April 12, 2007 11:26 PM
I tried this on a PDF on a 12 page FAX that someone sent upside down. When I did the "pdftk in.pdf cat 1S output out.pdf" to rotate it 180 degrees it successfully rotated the first page but the out.pdf file only contained 1 of the 12 pages that were in in.pdf

#

Re:Missing page rotations

Posted by: Administrator on April 13, 2007 12:29 AM
After experimenting a little I was able to get it to work. The trick is that the 1 in 1S specifies the page number. I was able to have all 12 pages rotated with the following command: pdftk in.pdf cat 1-12S output out.pdf

#

Re(1):Missing page rotations

Posted by: Anonymous [ip: 130.188.8.11] on October 23, 2007 08:05 AM
I guess you didn't catch this:
pdftk in.pdf cat 1-endS output out.pdf

so you don't need to know how many pages the document has.

#

WOW!!!!

Posted by: Anonymous [ip: 146.83.221.115] on January 11, 2008 03:42 PM
I'm New in LINUX, but it really seems to work... I have all i need (and i use my computer very intensively) Thank you a lot for guys like you...

#

Manipulating PDFs with the PDF Toolkit

Posted by: Anonymous [ip: 202.88.34.110] on January 15, 2008 03:31 PM
plzz tell me how can i use pdftk with php on LINUX. I used it on Windows as folows to auto fill pdf's like this :

passthru( 'E:\Projects\PHPProjects\PDF\pdftk S-Corp-Blank.pdf fill_form flatfile.fdf output abc.pdf flatten') ;

Im a newbie to linux plz tell me what path do i give instead of E:\Projects\PHPProjects\PDF\pdftk on linux to invoke pdftk using passthru in php , caz i have read that passthru requires complete path of the application in order to run it .

thanx ,
waiting for reply ,
Fahad.

#

splitting a single page into two?

Posted by: Anonymous [ip: 221.188.58.243] on February 05, 2008 03:31 PM
Let's say a PDF comprises A4 and A3 pages, and I want to split each A3 page into two A4 pages so that I can print it unreduced. Is such functionality possible? Thanks, Gernot

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya