June 6, 2007

A simple script for tracking Web sites

Author: Leslie P. Polzer

Many Web sites feature RSS feeds and newsletter subscriptions that let you know when they've updated their contents, but unfortunately, a significant number of sites still don't. How can you keep up with them? Let's build a shell script to automate that task.

First, since solving problems is easier when you don't have to do it yourself, let's find out whether somebody has already handled this problem. Some Google search runs later it's evident that the few available tools are all for Microsoft Windows, and like most programs for Windows, they are not free of charge and limit your freedom.

For Linux, there's GPL-licensed WebMonX, but it's a GUI tool that requires lots of clicking and notifies you with popups and sounds. If that's your thing, fine -- you have found a ready-made solution that suits your needs. If not, let's try writing a simple script that meets some KISS criteria:

  • Unobtrusive: popping up a message on every change is a no-no. An email message should do the job nicely.
  • Small: only a few lines of code.
  • Modular: should rely on widely available and well-tested components.
  • Smart: when changes are detected, we want a diff of them.

We need a text browser -- for example, w3m -- to get the pages in rendered form. Just grabbing the raw HTML or HTTP request answer would do, of course, but it's not nice to look at. Second, we'll use a hash program like md5sum or sha1sum -- both of which can be found in the GNU Coreutils package -- to generate a name for the file where we store a snapshot of the page. Then we need a working diff and, finally, an implementation of the mail command, which should be provided by your local MTA. We can also use some basic utilities that should be installed on every system, such as wc and touch.

When everything is in place, we can use the following shell script to do our tracking task. It scans the file list.txt, reading one URL from each line. We get a current version of the URL's contents and compare it with the saved version, then send changes, if there are any, to the email address specified in the RECIP variable.

# webtrack.sh

RECIP=user@host      # where notifications get sent
DUMPCMD="w3m -dump"  # text browser invocation

for url in $(cat list.txt); do

    md5=$(echo "$url" | md5sum | cut -d\  -f 1)

    touch $md5.txt

    $DUMPCMD "$url" > tmp.txt

    if diff $md5.txt tmp.txt >/dev/null; then
        : #echo no changes
        : #echo "changes: "
        diff -Napu $md5.txt tmp.txt > diff.txt
        mv tmp.txt $md5.txt
        mail -s "Changes in $url found." "$RECIP" <<eof
The diff has $(wc -l diff.txt | cut -d\  -f 1) lines.

Changes are below.

$(cat diff.txt)



Now just populate list.txt with one URL per line, make the script executable (chmod 755 webtrack.sh) and set up a cronjob for it with an entry like this in your crontab file: 0 8 * * * /path/to/webtrack.sh. This will check the sites in list.txt every morning at 8 a.m. Check the crontab(1) man page if you are not sure what to do with this line.

It's also nice to have a script that appends a new URL to list.txt. For local lists, we can just use echo directly to append the URL. For a remote list, we execute echo remotely via ssh.

# ww-add.sh

# if the list is local
echo '$1' >> /path/to/list.txt

# if the list is remote
ssh user@host "echo '$1' >> /path/to/list.txt"

Happy tracking!

We can easily learn from this little exercise that shell scripts can make our life easier and save us hours of time compared to doing things manually over and over.

Leslie P. Polzer is a free software consultant and writer who has plenty of experience in leaving chores to the computer.

Click Here!