July 8, 2008

Protecting against evil code fragments with HTML Purifier

Author: Ben Martin

HTML Purifier is a project that helps you ensure that HTML is valid and does not contain cross-site scripting attempts or other nasty attacks. With HTML Purifier you can allow users to post HTML content without letting them insert nasty code that will run in the browser of anyone viewing that HTML. An assortment of plugins let you use HTML Purifier with CodeIgniter, Drupal, MODx, Phorum, Joomla!, and WordPress. To get an idea of the cleanups that HTML Purifier can perform, head over to the demo page.

HTML Purifier uses a whitelist approach to security, where all parts of a valid HTML document must be explicitly permitted, rather than a blacklist that looks for known nasty HTML code. The smoke test page explicitly lists which things are permitted and in what context. One aim of HTML Purifier is that it should fully understand what valid HTML is, which elements can be nested in others, and what is valid content for the HTML attribute attached to a particular element. HTML Purifier also includes support for CSS and can do things like translating text prefaced with http:// into proper HTML href elements automatically. If you are already using an HTML validation tool, you might like to take a look at the project's comparison page to see if you might like HTML Purifier as a replacement.

There are no packages of HTML Purifier for Ubuntu, Fedora, or openSUSE. HTML Purifier can be installed using PEAR, which not only gets it installed quickly but also allows you to easily move to the latest version using pear upgrade. PEAR makes including HTML Purifier in your scripts simpler because you do not need to specify any path in your script. There are also three different tarballs offered for those who want to build HTML Purifier manually from sources. The different versions include the code offered with and without documentation, and a third tarball that includes all the dependencies you need.

To install HTML Purifier through PEAR you must first install the php-pear package, then use the pear command to install HTML Purifier. The commands below will install HTML Purifier at /usr/share/pear/HTMLPurifier.

pear channel-discover htmlpurifier.org
pear install hp/HTMLPurifier

For me, trying to use HTML Purifier at this stage failed with an error in the Apache log files about the Cache.SerializerPath path not existing. HTML Purifier tried to use /usr/share/pear/HTMLPurifier/DefinitionCache/Serializer as a writable path for caching content. The cache can be turned off as detailed in the INSTALL file, or you can create the directory in /usr that HTML Purifier wants to use as a volatile cache, or create a new directory in /var to handle the cached data. The third option is shown below:

# mkdir -p /var/cache/HTMLPurifier
# chown apache /var/cache/HTMLPurifier
# chmod o-rwx /var/cache/HTMLPurifier
# ls -ld /var/cache/HTMLPurifier
drwxr-x--- 2 apache root 4096 2008-06-25 14:25 /var/cache/HTMLPurifier

Unfortunately the default path for the SerializerPath is encoded in HTMLPurifier/ConfigSchema/schema.ser, which is a length-delimited file that is not very human-edit-friendly. The best solution is to use a configuration object in your PHP code to change the path, or better yet, your own PHP function that sets up the configuration object for your Web site.

Below is a simple index.php file that uses HTML Purifier to clean up HTML content that is submitted via a form supplied on the same HTML page. Note that the call to htmlspecialchars is not used for security, but simply to enable the HTML text entered by the user to be fully seen within the pre element.

# cd /var/www/html
# mkdir HTMLPurifierTest
# chown ben.apache HTMLPurifierTest
# chmod +s HTMLPurifierTest
# su -l ben
$ cd /var/www/html
$ vi index.php

require_once 'HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('Core', 'Encoding', 'ISO-8859-1');
$config->set('HTML', 'TidyLevel', 'heavy' );
$config->set('Cache', 'SerializerPath', '/var/cache/HTMLPurifier' );
$purifier = new HTMLPurifier($config);



Enter your nastiest HTML below!

<form name="myfrom" action="index.php">
<input type='text' name='query'></form>


This is the clean part of what you said...

$clean_html = $purifier->purify($query);
print htmlspecialchars($clean_html);

If you wish to explicitly limit the HTML elements that a user can enter, use the ForbiddenElements configuration directive as shown below. This example will strip out any bold, italic, or preformatted tags from the HTML entered. You can also go the other way and explicitly whitelist which elements are valid using AllowedElements.

$config->set('HTML', 'ForbiddenElements', 'b,i,pre');

HTML Purifier includes support for filtering and mangling URIs both before and after the main validation. Being able to filter before you validate the HTML input allows you to change URIs that are not valid into something that is so that HTML Purifier does not reject the URI. For example, if you are allowing the user to link to images or other media files, then you might just pass in the unique identifier of the image and have a custom URI mangler substitute these custom URIs with real absolute HTTP URLs.

One URI filter is the host blacklist, which lets you block given host names. Be careful using the host blacklist, because if anything you blacklist appears anywhere in the URL it will be rejected. Luckily the code for the host blacklist class is short, so you could easily define a class that tested for only URLs ending with a given postfix. There are more such URI filters listed as possibly coming soon.

Wrap up

Presumably the installation issues with the cache directory are a limitation of PEAR. At least having HTML Purifier fail to work while producing a nice verbose error message forces the issue of where to store volatile cache documents rather than just silently using a path under /usr.

HTML Purifier offers protection against people entering nasty HTML code instead of well-formed HTML fragments in your Web forms. The ability to enter whitelists for things like which HTML elements can be used along with URI filtering should also take the fun out of users trying to explicitly enter invalid data into your forms. URI filtering is a great option for helping with forum spam if you are allowing anonymous forum posts. For example, you could enforce a policy that allows people to post only links to your own site when they post anonymously, and if they want to link to another site then they have to register first.


  • Internet & WWW
  • Tools & Utilities
Click Here!