Linux.com

Feature

CLI Magic: the word on wget

By Joe Barr on September 12, 2005 (8:00:00 AM)

Share    Print    Comments   

OK, you laggardly louts late to the Linux party, listen up! This week's column is all about power to the people. Command line power. Power that keeps working while you're off lollygagging. We're talking about GNU Wget: the behind-the-scenes, under-the-hood, don't-need-watching, network utility that speaks HTTP, HTTPS and FTP with equal fluency. Wget makes it easy to download a personal copy of a Web site from the Internet to peruse offline at your leisure, or retrieve the complete contents of a distribution directory on a remote FTP site.
The basic format for the wget command is as follows:

wget -options protocol://url

Let's save the options for later and begin by looking at the protocol://url combination. As noted above, Wget groks HTTP, HTTPS and and FTP. Indicate how you want to talk to the remote site by specifying one of: http, https, or ftp. Like this:

wget -options ftp://url

As for the site, let's try to get a complete copy of the current version of Slackware. It will be difficult because there is limited bandwidth available and the connections are rationed. Filling out the URL, our command looks like this before selecting the options:

wget -options ftp://carroll.cac.psu.edu/pub/linux/distributions/slackware/slackware-current/

Now about those options. We'll only need two to get the job done: -c and -r. You can combine those into a single option so the complete command looks like this:

wget -cr ftp://carroll.cac.psu.edu/pub/linux/distributions/slackware/slackware-current/

The -c option tells wget to continue a previously executed wget or ftp session. This allows you to recover from network interruptions or outages without starting from byte zero. The -roption tells wget that this is a recursive request and that it should retrieve everything in and below the target URL.

As it happens, I was able to get a connection to the ftp server, but lost it before the entire contents of the directory had been retrieved. After trying 20 times to reconnect, wget threw up its hands in despair and quit, informing me that 1,000 files and 422 million bytes of data had been transferred. I suspect -- due to the round number of files -- that the connection may have been terminated due to a daily quota by the server rather than the number of options.

In any case, there is another option, the -t number option, to specify the number of times to try to reconnect. The default is 20, but you can set it to be any number you like. If you specify -t 0, wget will try an infinite number of times.

Wget a website

You can also use wget to create a local, browsable version of a Web site. Note that this method does not work on all sites, but works perfectly well on sites which rely on plain HTML to publish content. It doesn't work well, for example, on sites like Linux.com. But for sites like The Dweebspeak Primer, it's great.

We'll replace the ftp protocol in the command line with http, and add a couple of new options in order to create a local, browsable version of the site. The -E option (case is important) tells wget to add an .html extension to each page it downloads that may have been generated by a CGI or which has an .asp extension so that it is viewable locally. You may also want to add the -k and -K options. The -k option ensures that links are converted for local viewing. The -K option backs up the original version of a file with a ".orig" suffix, so that different stories that are generated with the same page name are not overwritten.

Here is what I used to duplicate my site:

wget -rEKk http://www.pjprimer.com

Conclusion

As always with CLI Magic, this is an introduction to a command line tool, not a complete tutorial. Get to know the man and use it to learn more about wget and other useful command line jewels.

Share    Print    Comments   

Comments

on CLI Magic: the word on wget

Note: Comments are owned by the poster. We are not responsible for their content.

isn't wget an inactive project?

Posted by: Anonymous Coward on September 12, 2005 09:37 PM
I thought wget was an inactive project and people were using curl now.

#

Re:isn't wget an inactive project?

Posted by: Anonymous Coward on September 13, 2005 12:17 AM
<a href="http://www.gnu.org/software/wget/wget.html" title="gnu.org">You thought wrong</a gnu.org>

#

Tips

Posted by: Anonymous Coward on September 13, 2005 05:23 AM
This command may have unexpected results:

wget -cr <a href="ftp://carroll.cac.psu.edu/pub/linux/distributions/slackware/slackware-current/" title="psu.edu">ftp://carroll.cac.psu.edu/pub/linux/distributions<nobr>/<wbr></nobr> slackware/slackware-current/</a psu.edu>

wget may follow the link to the parent directory and end up mirroring the whole FTP. From the wget man page:


  -np

  --no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.

Next, another helpful option: Specify exact;y which filetypes you want. The -A option, the accept list, can take a list of extentions that wget should retrieve from the target. It might not be useful with that particular FTP, but FTPs with mixed filetypes. If you wanted to get all the music files from a directory on a FTP, you would do this:

wget -cr -np -A mp3,ogg,flac,mpc <a href="ftp://somesite/somedirectory/" title="somesite">ftp://somesite/somedirectory/</a somesite>

#

Re:Tips

Posted by: Anonymous Coward on September 14, 2005 05:02 AM
Shouldn't that be impossible with ftp? FTP doesn't have a concept of links, only directory structure.

However, if the download site is http-based, I completely agree. I've gotten burned by that before, so thanks for bringing "-np" to my attention.

#

Re:Tips

Posted by: Anonymous Coward on September 14, 2005 05:48 AM
I think you are right. I had the problem with HTTP sites, but you shouldn't need -np for FTP.

#

Re:Tips

Posted by: Anonymous Coward on September 18, 2005 06:56 PM
The FTP protocol may not provide any commands to manipulate symlinks but they can exist on the underlying filesystem and some FTP clients can obviously see them.

For example, I created a symlink on my private FTP account. The server is proftpd.

Here is a sample of session with ncftp:

# ncftp <a href="ftp://localhost:1234" title="localhost">ftp://localhost:1234</a localhost>
ncftp / < ls
here@ incoming/ README.txt
ncftp / < ls -la
drwxr-xr-x 3 ftp ftp 4096 Sep 18 10:49 .
drwxr-xr-x 3 ftp ftp 4096 Sep 18 10:49<nobr> <wbr></nobr>..
-rw-r--r-- 1 ftp ftp 607 Sep 12 21:26 README.txt
lrwxrwxrwx 1 ftp ftp 1 Sep 18 10:49 here -< .
drwxr-xr-x 2 ftp ftp 4096 Sep 13 18:51 incoming


       

#

Re:Tips

Posted by: Anonymous Coward on September 18, 2005 06:57 PM
Hoops! all < should of course be > in the previous msg.

#

Wget is good

Posted by: Anonymous Coward on September 13, 2005 10:36 PM
You can get a file from a website by:
# wget http://www.example.com/funky_drums.mp3

For list of command line options to use:
# wget --help

For manual page (manpage)
# man wget

It is also good for mirroring websites, (such as defaces<nobr> <wbr></nobr>:p). There is a --mirror option for that, IIRC.

#

Pain

Posted by: Anonymous Coward on May 28, 2006 03:25 PM
<tt>[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]
[URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]
[URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.c<nobr>o<wbr></nobr> m] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/pa<nobr>i<wbr></nobr> nrelief.htm] Nerve pain relief [/URL]</tt>

#

scripting around wget

Posted by: Anonymous Coward on September 18, 2005 06:23 PM
The recursive options are nice but will not work if the web site does not explicitly link to the files you want to download.

For example, you are browsing www.dummy.net and notice 2 interesting images img/pict013.jpg and img/pict134.jpg.
More images of the form img/pict???.jpg may be available in that directory. Most sites will not allow you to get directory listings but you can scan using the following shell script:

A=1
while [ $A -lt 999 ] ; do

    N=`printf %03d $A`

    wget <a href="http://www.dummy.net/img/pict$N.jpg" title="dummy.net">http://www.dummy.net/img/pict$N.jpg</a dummy.net>

    let A=A+1
done

or if you are not afraid of typing a single long shell commands:

A=1; while [ $A -lt 999 ] ; do N=`printf %03d $A` ; wget <a href="http://www.dummy.net/img/pict$N.jpg" title="dummy.net">http://www.dummy.net/img/pict$N.jpg</a dummy.net><nobr> <wbr></nobr>;let A=A+1 ; done

NOTE: the [dummy.net] after the URL is inserted automatically ans so is not part of the command.

#

shame on me

Posted by: Anonymous Coward on September 18, 2005 06:39 PM
I just discovered that the 'for' command (in bash at least) has a variant that allows C like iterations.

I have been using the 'while' syntax in bash for about 10 years. Better late than never<nobr> <wbr></nobr>:-)

for ((i=1;i<1000;i++)) ; do

    N=`printf %03d $i`

    wget <a href="http://www.dummy.net/img/pict$N.jpg" title="dummy.net">http://www.dummy.net/img/pict$N.jpg</a dummy.net>
done

#

Re:shame on me

Posted by: Anonymous Coward on September 25, 2005 10:47 PM
for i in `seq -w 1 999`; do wget <a href="http://www.dummy.net/img/pic$i" title="dummy.net">http://www.dummy.net/img/pic$i</a dummy.net>; done

#

encoded url

Posted by: Anonymous Coward on December 10, 2005 04:23 AM
i'm still using wget... even there is mo much another downloader now, like axel, prozilla, mcurl, etc...
the only thing i hate bout wget, the output file is encoded

ex:
www.foo.com/this%20is%20a%20file.zip

i think its better if the output file will this is a file.zip

#

Re:"&amp;" character a problem in wget file names?

Posted by: Anonymous Coward on March 15, 2006 08:38 AM
grover you need to escape the ampersands on the command line for a wget command. to do this, either put the wget command in quotes:


    wget "http://foo.com?bar=baz&quix=hello&something"

or escape them with backslashes:


    wget <a href="http://foo.com?bar=baz" title="foo.com">http://foo.com?bar=baz</a foo.com>\&quix=hello\&something

#

"&amp;" character a problem in wget file names?

Posted by: Administrator on January 25, 2006 08:47 AM
I'm new to command line executions, and I'm having a problem with wget.

I set up one cron job to get/save an xml file, and it works fine. I set up another one to get/save an image file from the same server, but it doesn't download the file properly. The problem seems to have someting to do with ampersands (&) in the file name. The file name looks like this:

<a href="http://pointer.site.com/chart.aspx?provider=CSV&qualifier1&qualifier2&qualifier3&qualifier4&qualifier5&qualifier6&qualifier7&qualifier8" title="site.com">http://pointer.site.com/chart.aspx?provider=CSV&q<nobr>u<wbr></nobr> alifier1&qualifier2&qualifier3&qualifier4&qualifi<nobr>e<wbr></nobr> r5&qualifier6&qualifier7&qualifier8</a site.com>

The file is recognized as a gif image and gets saved on my system as "chart.aspx?provider=CSV" but its corrupted and doesn't view properly.

I tried to work around the problem by adding "--restrict-file-names=nocontrol" to the wget command, but it doesn't change anything.

Any/all suggestions will be appreciated.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya