March 1, 2007

Using squidGuard for content filtering

Author: Keith Winston

Content filtering for the Web can be a messy proposition. A business may need to block only the most objectionable Web sites, while schools may be required by law to follow a more thorough process. Whatever your needs, you can build a solution with only open source pieces: squid, squidGuard, and blacklists.

The squid server acts as an intermediary between a Web browser and Web server. As a proxy, it receives a URL request from the browser, connects to the server on behalf of the browser, downloads content, then provides it to the browser. It also saves the content to disk so it can provide it more quickly to another browser if the same URL is requested in the near future. Generally, this leads to more efficient utilization of an Internet connection and faster response times for Web browsers.

A typical hardware setup uses physical two network cards on the proxy server. One connects to the internal network, where squid listens for incoming HTTP requests on the default port 3128. The other connects to the Internet, from which it downloads content.

Squid is available for most Linux distributions as a standard package. I was able to get squid running on Red Hat Linux with sane defaults by simply installing the RPM and setting a few options in the /etc/squid/squid.conf configuration file:

visible_hostname your-server-name
acl our_networks src 192.168.0.0/16
http_access allow our_networks
http_access deny all

The visible_hostname tells squid the name of the server. The acl is an access control list used in the http_access rule to allow internal clients to connect to squid. For security reasons, it is important to ensure that users outside your network can't use squid; this is achieved by adding a deny rule near the bottom of your configuration.

Tell the browsers

Most Web browsers behave a little differently when they know they are talking to a proxy server. In Firefox 2.0, you enter proxy settings under Tools -> Options (Firefox - Preferences on Mac) -> Advanced section -> Network tab, then click the Settings button under Connection.

Firefox proxy

Once the browser is configured, it should make requests and get responses from squid.

Another way to use squid is in transparent proxy mode. Transparent proxies are often used to force Web traffic through the proxy regardless of how each browser is configured. Doing so requires some network trickery to hijack outgoing HTTP requests and also requires additional tweaks to squid. You can read useful guides for configuring squid as a transparent proxy elsewhere.

Redirectors

With no additional configuration, squid faithfully fetches and returns each URL requested of it. To filter the content, squid has a feature called a redirector -- a separate program called by squid that examines the URL and tells squid to either proceed as usual or rewrite the URL so squid returns something else instead. Most often, redirectors rewrite banned URLs, returning the URL of a custom error page that explains why the requested URL was not honored.

Several third-party redirectors have been written, including squirm and squidGuard. Both squirm and squidGuard are C language programs that need to be compiled from source. Squirm operates using regular expression rules, while squidGuard uses a a database of domains and URLs to make decisions. I have not done any performance testing on redirectors, but squidGuard has a reputation for scaling well as the size of its blacklist increases. In my experience, squidGuard has performed well on networks with up to a thousand users.

Installing squidGuard 1.2.0

The squidGuard redirector is installed using the familiar "configure, make, make install" routine. One requirement that may not be installed on your system is the Berkeley DB library (now owned by Oracle), which squidGuard uses to store blacklist domains and URLs.

After running make install using the squidGuard source, I discovered that some directories were not created. I manually created the following directories:
/usr/local/squidGuard/ -- for configuration files
/usr/local/squidGuard/log/ -- for log files
/usr/local/squidGuard/db/ -- for blacklist files

Next, copy the sample configuration file to /usr/local/squidGuard/squidGuard.conf. We'll come back to the squidGuard configuration shortly.

To make squid aware of squidGuard, add these options to /etc/squid.conf:

redirect_program /usr/local/bin/squidGuard -c /usr/local/squidGuard/squidGuard.conf
redirect_children 8
redirector_bypass on

The redirect_program option points to the redirector binary and configuration file. The redirect_children option controls how many redirector processes to start. The redirector_bypass option tells squid to ignore the redirector if it becomes unavailable for some reason. If you do not set this option and squidGuard crashes or gets overloaded, squid will quit with a fatal error, perhaps ending all Web access.

Using a blacklist

To be effective as a filter, squidGuard needs a list of domains and URLs that should be blocked. Building and maintaining your own blacklist would require a huge investment in time. Fortunately, you can download a quality list and refresh it as it gets updated. One of the largest and most popular blacklists is maintained by Shalla Security Services.

The Shalla list contains more than one million entries categorized by subject, such as pornography, gambling, and warez. You can use all or any part of the list. The list is free for noncommercial use. For commercial use, a one-page agreement needs to be signed and returned to Shalla, but there is no cost to use the list unless it is embedded and resold in another product. Additional free and non-free blacklists are available, but the Shalla list is a good place to start.

To use it, download and unpack it in temporary directory. It will create a directory called BL with subject subdirectories below. Copy the directory tree below BL to the /usr/local/squidGuard/db/ directory. When you are done, the db directory should contain the subject subdirectories.

The blacklist itself is a set of plain text files named domains and urls. To allow squidGuard to use them, the text files must be loaded into Berkeley DB format. Before running the conversion process, return to the squidGuard.conf file and define which files you want to use.

The following is a basic squidGuard.conf configuration:

#
# CONFIG FILE FOR SQUIDGUARD
#
dbhome /usr/local/squidGuard/db
logdir /usr/local/squidGuard/log

# DESTINATIONS
dest spy {
        domainlist spyware/domains
        urllist spyware/urls
        log /usr/local/squidGuard/log/blocked.log
}

# ACCESS CONTROL LISTS
acl {
        default {
                pass !spy !in-addr all
                redirect http://webserver.com/blocked.html
        }
}

The dest block defines lists of domains and URLs, used later in the access control section. The example defines a "spy" destination using the spyware blacklist files defined with relative paths to the files in the db directory. It also uses the log option to write records to the blocked.log file when a match is found. The name and location of the log file can be changed.

The acl block defines what squidGuard does with requests passed to it from squid. The example instructs squidGuard to allow all requests that do not match the "spy" destination and are not IP addresses. The redirect option defines what URL to return if a request does not pass. So, if a request matches our blacklist, it gets redirected to the blocked.html page. It is also possible to set up a CGI script that can collect and report additional information, such as the user, source IP, and URL of the request.

The squidGuard configuration can be arbitrarily complex. I recommend starting out with a simple configuration and slowly adding to it and testing it until it meets your requirements.

Returning to the blacklist, it is time to run the Berkeley DB load process, using squidGuard to create the database files. This command starts the conversion process:

 /usr/local/bin/squidGuard -C all

With this command, squidGuard looks at its configuration file and converts the files defined. In the example, it would only convert the spyware lists, creating the files spyware/domains.db and spyware/urls.db. The loading process can take a while, especially on older hardware.

I ran into an issue with file permissions on the blacklist databases. If the files did not have permissions of 777, squidGuard was not able to use them. Even though the squidGuard processes ran as user squid and the files were owned by user squid with permissions of 755, squidGuard did not work as expected. In my setup, this was not a big problem because squidGuard was running on a standalone firewall. However, on a multi-user system, it would be a serious concern.

Using a whitelist

There are a couple of approaches to setting up a whitelist. One option is to create a whitelist directory under the squidGuard db directory and manage the whitelist using squidGuard ACLs. Another option is to create a file, such as /etc/squid/whitelist, and manage the exceptions with squid. Both options are effective, but I decided to manage the exceptions in squid for two reasons: it would eliminate a call to squidGuard, and it would be faster to modify. If the whitelist were maintained by squidGuard, squid would have to be restarted to make the changes active. With the whitelist maintained by squid, a much faster squid reload (re-reading the configuration file) is all that is required.

To configure the whitelist in squid, two extra options are needed in /etc/squid.conf:

acl white /etc/squid/whitelist
redirector_access white deny

The first option defines an access control list using the whitelist file. The whitelist file contains domain names (i.e., .youtube.com), one per line. The second option tells squid to skip the call to squidGuard if the URL is in the whitelist. The options must be defined in the order shown; the ACL must be defined before it is used.

Debugging and tuning

Both squid and squidGuard create useful log files. The primary squid log file is /var/log/squid/cache.log file. Squid is very clear when certain problems arise with the redirector. For example, these messages appeared in the squid log during my first full day of production using squidGuard:

WARNING: All redirector processes are busy.
WARNING: 5 pending requests queued
Consider increasing the number of redirector processes in your config file.

The setting in squid.conf for the number of redirectors is redirect_children, so correcting this was straighforward. Other issues may be more subtle.

Squid provides excellent internal diagnostic reports through squidclient, a program included with the squid pacakge. Use the following command on the machine where squid is installed to get general stastistics:
squidclient mgr:info

Use this command to see a report on the performance of the redirectors:
squidclient mgr:redirector

When squidGuard has a problem, it may not be as precise. A common error you may see in the squidGuard log is going into emergency mode. There may be additional helpful messages in the log file, but emergency mode usually means that squidGuard has stopped working. Often, there is a syntax error in the configuration file, but it could be a permissions issue or something else. You can test a squidGuard configuration from the command line before committing changes. Simply feed a list of URLs to squidGuard on the command line, using your test configuration file, and see if it returns the expected result. A blank line means squidGuard did not change the URL, while any other result means the URL was rewritten.

The long arm of the squid

Squid and squidGuard offer a reliable, fast platform for Web content filtering. If squidGuard doesn't meet your needs, additional redirectors are available, or you can roll your own. In addition to blacklisting, the redirector interface can be used to remove advertising, replace images, and other creative things. Content filtering with squid can be as coarse or as fine-grained as your needs.