September 7, 2010

Manage Linux Downloads with wget

Firefox, Chrome, and other browsers do an acceptable job of downloading a single file of reasonable size. But I don't like to trust a browser to grab ISO images and other files that are hundreds of megabytes, or larger. For that I prefer to turn to wget. You'll find that using wget provides some significant advantages over grabbing files with your browser.

First of all, there's the obvious — if your browser crashes or you need to restart for some reason, you don't lose the download. Firefox and Chrome have been fairly stable for me lately, but it's not unheard of for them to crash. That's a bit of a bummer if they're 75% of the way (or 98%) through downloading a 3.6GB ISO for the latest Fedora or openSUSE DVD.

It's also inconvenient when I want to download a file on a server. For example, if I'm setting up WordPress on a remote system I need to be able to get the tarball with the latest release on the server. It seems silly to copy it to my desktop and then use scp to upload it to the server. That's twice the time (at least). Instead, I use wget to grab the tarball while I'm SSH'ed into the server and save myself a few minutes.

Finally, wget is scriptable. If you want to scrape a Web site or download a file every day at a certain time, you can use wget as part of a script that you call from a cron job. Hard to do that with Firefox or Chrome.

Get Started with wget

Most Linux distributions should have wget installed, but if not, just search for the wget package. Several other packages use or reference wget, so you'll probably get several results — including a few front-ends for wget.

Let's start with something simple. You can download files over HTTP, FTP, and HTTPS with wget, so let's say you want to get the hot new Linux Mint Fluxbox edition. Just copy the URL to the ISO image and pass it to wget like so:


Obviously, you'd replace "mirrorsite" with a legitimate site name, and the path to the ISO image with the correct path.

What about multiple files? Here's where wget really starts showing its advantages. Create a text file with the URLs to the files, one per line. For instance, if I wanted to copy the CD ISO images for Fedora 14 alpha, I'd copy the URLs for each install ISO to a text file like this:

You get the idea. Save the file as fedoraisos.txt or similar and then tell wget to download all of the ISO images:

wget -i fedoraisos.txt

Now wget will start grabbing the ISOs in order of appearance in the text file. That might take a while, depending on the speed of your Net connection, so what happens if the transfer is interrupted? No sweat. If wget is running, but the network goes down, it will continue trying to fetch the file and resume where it left off.

But what if the computer crashes or you need to stop wget for some other reason? The wget utility has a "continue" option (-c) that can be used to resume a download that's been interrupted. Just start the download using the -c option before the argument with the file name(s) like so:

wget -c

If you try to resume a download after wget has been stopped, it will usually start from scratch and save to a new file with a .1 after the main filename. This is wget trying to protect you from "clobbering" a previous file.

Mirroring and More

You can also use wget to mirror a site. Using the --mirror option, wget will actually try to suck down the entire site, and will follow links recursively to grab everything it thinks is necessary for the site.

Unless you own a site and are trying to make a backup, the --mirror site might be a bit aggressive. If you're trying to download a page for archival purposes, the -p option (page) might be better. When wget is finished, it will create a directory with the site name (so if you tried, it'd be and all of the requisite files underneath. Odds are when you open the site in a browser it won't look quite right, but it's a good way to get the content of a site.

Password protected sites are not a problem, as wget supports several options for passing the username and password to a site. Just use the --user and --password options, like so: wget --user=username --password=password where the user name and password are replaced with your credentials. You might want to specify this from a script if you're on a shared system, lest other users see the username and password via top, ps or similar.

Sometimes a site will deny access to non-browser user agents. If this is a problem, wget can fake the user agent string with --user-agent=agent-string.

If you don't have the fastest connection in the world, you might want to throttle wget a bit so it doesn't consume your available bandwidth or hammer a remote site if you are on a fast connection. To do that, you can use the --limit-rate option, like this:

wget --limit-rate=2m

That will tell wget to cap its downloads at 2 megabytes, though you can also use k to specify kilobytes.

If you're grabbing a bunch of files, the -w (wait) option can pause wget between the files. So wget -w=1m will pause wget one minute between downloads.

There's a lot more to wget, so be sure to check the man page to see all the options. In a future tutorial, we'll cover using wget for more complex tasks and examining HTTP responses from Apache.