April 29, 2008

Automatically watching Web sites for changes

Author: Ben Martin

If you want to be notified when and how a Web site has changed, you can turn to either netstiff or urlwatch to keep and eye on things for you. Both of these tools monitor Web sites for changes and allow you to see a diff-like output of exactly what has changed. You can also use netstiff to monitor FTP sites for changes.

I couldn't find any packages of netstiff or urlwatch for mainstream Linux distributions, though there is a .deb file available through netstiff's freshmeat page. Netstiff is written in Ruby and urlwatch in Python, so neither requires compilation. Netstiff comes with a simple Makefile with a make install target that will install the software on the system, or you can just run it from where you expanded the archive. urlwatch does not include an install target in its makefile; you must either install it by hand or use it from its expanded archive directory.

For demonstration purposes I used linux.com as the Web site to test these tools against. Because I can't arbitrarily change the features on the linux.com home page, for demonstration purposes I made a local snapshot of the site and made changes on the local mirror (http://localhost/www.linux.com-snapshot/index.html) to see how the tools detected the changes.


The first time you run netstiff it detects that you have no configuration file and displays a menu that allows you to set up URLs to watch. For each URL that you add via the menu you can specify the user-agent, referer, language, timeout, proxy, and a few other options used when fetching data from that URL. One useful option is a start and end regular expression (regex). The start regex allows you to throw away anything in the HTML obtained from the URL that precedes the regex. This is handy for trimming off a banner that appears at the top of a Web page, as you'll see in the example below. In a similar manner, the end regex tells netstiff to ignore any content after that regex match. Having well-chosen values for the start and end regex makes netstiff report only when meaningful changes occur on the URL you are monitoring.

The configuration file for netstiff is designed to be human readable and editable. Netstiff's man page describes the format and options in detail. You can continue to use the configuration menu for netstiff to maintain your URLs by invoking netstiff -c. If you don't specify the -c option, netstiff will check the URLs in your configuration to see what has changed. Here, I set up netstiff to monitor my local linux.com snapshot.

$ ./netstiff
No such file or directory - /home/ben/.netstiff/config
netstiff 20080331 Copyright (C) 2004, 2007-2008 Stephan Beyer, GNU GPL

[0] Global settings
[n] Add new URI

[f] Finished (exit & save)
[x] Exit without saving

Your choice [n/f/x/0]: n

Please type the correct URI of the web page you want to check for updates:
> http://localhost/www.linux.com-snapshot/index.html

[u] URI: http://localhost/www.linux.com-snapshot/index.html
[m] Test method: diff
[x] Menu title: [not set]

[a] User-Agent: [not set]
[r] Referer: [not set]
[l] Accept-Language: [not set]
[b] Range: [not set]
[t] Timeout: [not set]
[p] Proxy: [not set]
[ ] Command HTML dumper: [only available w/ html]
[s] Start regexp: [not set]
[e] End regexp: [not set]

[d] Delete URI from list.
[f] I have finished.

Your choice [u/m/x/a/r/l/b/t/p/s/e/d/f]: s
> /Linux.com : Features/


Your choice [u/m/x/a/r/l/b/t/p/s/e/d/f]: e

Please type the regexp, e.g. /[Hh]ugo\s+wrote/ ...
> /div class="divider"/

Your choice [u/m/x/a/r/l/b/t/p/s/e/d/f]: f
[f] Finished (exit & save)
[x] Exit without saving

Your choice [n/f/x/0/1]: f

To make sure that netstiff has a copy of the newly configured Web site in its local cache you should run netstiff without any options. After doing this, to force a change, I edited the local snapshot to include a double capital X at the start of the first feature description, then ran netstiff a second time to see if it would detect the change. The program reported the change in unified diff format as shown below.

$ ./netstiff
diff --netstiff diff http://localhost/www.linux.com-snapshot/index.html
--- http://localhost/www.linux.com-snapshot/index.html
+++ http://localhost/www.linux.com-snapshot/index.html
@@ -531,7 +531,7 @@
By <a href="http://www.wethinkthebook.net/home.aspx">Charles Leadbeater</a> onĀ April 02, 2008 (9:00:00 PM)
- <p>Linux has succeeded as a product only because the community that supports it ...
+ <p>XX Linux has succeeded as a product only because the community that supports it ...
<a href="http://www.linux.com/feature/130025">Read the Rest</a> -
<a href="http://www.linux.com/feature/130025#commentthis">Post Comment</a>

To test the start regex, I injected some extra text into one of the NewsVac items and reran netstiff, expecting it to completely ignore this change. Unfortunately, netstiff reported the change that it should have ignored because it was before the start regex. It turned out that the start and end regex that I added with the netstiff menu were not written to the ~/.netstiff/config configuration file -- apparently a bug. Once I added the options manually, netstiff faithfully ignored modifications to the NewsVac entries.

The configuration file is shown below. Options are specified with a plus sign as a prefix. An option is applied to the previous URL specified -- in this case, my local linux.com snapshot URL.

$ cat ~/.netstiff/config
+ start /Linux.com : Features/
+ end /div class="divider"/


Urlwatch performs the same basic task as netstiff. Urlwatch is written in Python and uses a small Python hook function in place of the start and end regex that netstiff uses. Python fans might find urlwatch easier to extend and customize than netstiff.

You configure the URLs to watch by placing one URL per line into the file urls.txt. When you run watch.py, newly added URLs are reported to the console. Before the last invocation of watch.py I modified the linux.com snapshot to include the "XX" change again.

$ ./watch.py
NEW: http://localhost/www.linux.com-snapshot/index.html

$ ./watch.py
$ ./watch.py
CHANGED: http://localhost/www.linux.com-snapshot/index.html
@@ -531,7 +531,7 @@
By <a href="http://www.wethinkthebook.net/home.aspx">Charles Leadbeater</a> onĀ April 02, 2008 (9:00:00 PM)
- <p>Linux has succeeded as a product only because the community...
+ <p>XX Linux has succeeded as a product only because the community...
<a href="http://www.linux.com/feature/130025">Read the Rest</a> -
<a href="http://www.linux.com/feature/130025#commentthis">Post Comment</a>


Urlwatch includes a hook function, defined in hooks.py, to let you remove portions of HTML that you are not interested in tracking. After urlwatch fetches data from a URL, it passes both the URL and the fetched data to the hook function to allow you to trim out content that is not of interest. You can do whatever you like in this hook function and must return the data that you want urlwatch to compare. For example, you might like to strip out any headers or footers that might contain today's date, or other volatile data, so that urlwatch does not keep telling you that these uninteresting things have changed. By default, hooks.py returns the data that is passed into the function unmodified.

The hook.py examples that the developers provide use the Python re.sub function to modify the passed in data to give an idea of what transformations you might like to perform. Because re.sub operates on a per-line basis, I use it three times below to get the same effect as the start regex of netstiff. Because the HTML from linux.com has newlines that I might like to keep, I first change the existing newlines to HTML comments, then remove anything that precedes the features that I am interested in, and then change the HTML comments that I inserted back into real newlines. Keeping the newlines intact can be handy when urlwatch gives you the diff showing what has changed on the Web site. If you keep the newlines in place, diff is easier to read without using word wrap.

Note the ^ character at the start of the middle re.sub call. Without that anchor to the start of line, the re.sub call will be much more expensive to evaluate, going from almost instant to multiple seconds, because it will be evaluated for each character in the HTML. With the below changes to hook.py, I can again change the NewsVac entries and not see those changes reported by urlwatch.

$ cat hooks.py
import re

def filter(url, data):
if url == 'http://www.inso.tuwien.ac.at/lectures/usability/':
return re.sub('.*TYPO3SEARCH_end.*', '', data)
elif url == 'http://localhost/www.linux.com-snapshot/index.html':
print 'match!'
data = re.sub('\n', '<!--NEWLINE-->', data )
data = re.sub('^.*Linux.com : Features', 'Linux.com : Features', data)
return re.sub('<!--NEWLINE-->', '\n', data )
return data

Tools like netstiff and urlwatch can be very useful to help you keep an eye on Web sites and FTP sites. One good use case is for software that's not released into package repositories but simply copied to an FTP server when a new release is made. With netstiff you can monitor these FTP sites and be notified automatically when a new release is made. Being able to see the differences only when the interesting part of a Web page changes lets you avoid being notified of meaningless changes to banners or JavaScript.


  • Reviews
  • Tools & Utilities
  • Internet & WWW