September 12, 2005

CLI Magic: the word on wget

Author: Joe Barr

OK, you laggardly louts late to the Linux party, listen up! This week's column is all about power to the people. Command line power. Power that keeps working while you're off lollygagging. We're talking about GNU Wget: the behind-the-scenes, under-the-hood, don't-need-watching, network utility that speaks HTTP, HTTPS and FTP with equal fluency. Wget makes it easy to download a personal copy of a Web site from the Internet to peruse offline at your leisure, or retrieve the complete contents of a distribution directory on a remote FTP site.The basic format for the wget command is as follows:


wget -options protocol://url

Let's save the options for later and begin by looking at the protocol://url combination. As noted above, Wget groks HTTP, HTTPS and and FTP. Indicate how you want to talk to the remote site by specifying one of: http, https, or ftp. Like this:


wget -options ftp://url

As for the site, let's try to get a complete copy of the current version of Slackware. It will be difficult because there is limited bandwidth available and the connections are rationed. Filling out the URL, our command looks like this before selecting the options:


wget -options ftp://carroll.cac.psu.edu/pub/linux/distributions/slackware/slackware-current/

Now about those options. We'll only need two to get the job done: -c and -r. You can combine those into a single option so the complete command looks like this:


wget -cr ftp://carroll.cac.psu.edu/pub/linux/distributions/slackware/slackware-current/

The -c option tells wget to continue a previously executed wget or ftp session. This allows you to recover from network interruptions or outages without starting from byte zero. The -roption tells wget that this is a recursive request and that it should retrieve everything in and below the target URL.

As it happens, I was able to get a connection to the ftp server, but lost it before the entire contents of the directory had been retrieved. After trying 20 times to reconnect, wget threw up its hands in despair and quit, informing me that 1,000 files and 422 million bytes of data had been transferred. I suspect -- due to the round number of files -- that the connection may have been terminated due to a daily quota by the server rather than the number of options.

In any case, there is another option, the -t number option, to specify the number of times to try to reconnect. The default is 20, but you can set it to be any number you like. If you specify -t 0, wget will try an infinite number of times.

Wget a website

You can also use wget to create a local, browsable version of a Web site. Note that this method does not work on all sites, but works perfectly well on sites which rely on plain HTML to publish content. It doesn't work well, for example, on sites like Linux.com. But for sites like The Dweebspeak Primer, it's great.

We'll replace the ftp protocol in the command line with http, and add a couple of new options in order to create a local, browsable version of the site. The -E option (case is important) tells wget to add an .html extension to each page it downloads that may have been generated by a CGI or which has an .asp extension so that it is viewable locally. You may also want to add the -k and -K options. The -k option ensures that links are converted for local viewing. The -K option backs up the original version of a file with a ".orig" suffix, so that different stories that are generated with the same page name are not overwritten.

Here is what I used to duplicate my site:


wget -rEKk http://www.pjprimer.com

Conclusion

As always with CLI Magic, this is an introduction to a command line tool, not a complete tutorial. Get to know the man and use it to learn more about wget and other useful command line jewels.

Click Here!