Saturday, September 17, 2011

wget tutorial, how to download files, webpages, and mirror websites

Downloading files with wget

Wget is a free utility for non-interactive download of files from the Web. It supports http, https, and ftp protocols, as well as retrieval through http proxies.

If say you want to download a url like www.softpedia.com,
You can Just type:

wget http://www.softpedia.com

If you want to dowload a file from softpedia, say avira antivirus,
you will have to just type wget, followed by the link location of the file, and hit enter. e.g.

wget http://www.softpedia.com/avira_antivirus_personal.exe

But if the connection is slow, and the file is lengthy? The connection will probably fail before the whole file is retrieved, more than once. In this case, Wget will try getting the file until it either gets the whole of it, or exceeds the default number of retries ( being 20). It is easy to change the number of tries to 50, to insure that the whole file will arrive safely:

wget --tries=50 http://www.softpedia.com/avira_antivirus_personal.exe

We can leave Wget to work in the background, and write its progress to a log file. It is better to use ‘-t’, instead of -tries.

wget -t 50 -o log http://www.softpedia.com/avira_antivirus_Personal.exe &

this tells wget to keep retrying to get the file, up to 50 times. the ampersand at the end, tells wget to continue downloading in the background, while the (-o log) tells wget to report any errors to a file named log.

In ftp downloads, you simply type wget, followed by number of retries, except if you don't want to use the default number of retries, then the file address.

wget ftp://gnjilux.srk.fer.hr/welcome.msg

If you specify a directory, Wget will retrieve the directory listing, parse it and convert it to html.
E.g.
wget ftp://ftp.gnu.org/pub/gnu/
links index.html

You can equally put all the urls you want to download in a file, and save it. you will then use the -i switch to specify the file. e.g.

wget -i file

To mirror a website, simply type:

wget -m http://your_website_url

To create a three levels deep mirror image of the GNU web site, with the same directory structure the original has, with only one try per document, saving the log of the activities to gnulog: type

wget -r -l3 http://www.gnu.org/ -o gnulog

To download a three levels deep mirror of the gnu web site as above, and convert addresses of the links of the downloaded files to point to local files, so you can view the documents off-line:

wget --convert-links -r -l3 http://www.gnu.org/ -o gnulog
or for short, use

wget -mk -l3 http://www.gnu.org/ -o gnulog

Retrieve only one html page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.

wget -p --convert-links http://www.server.com/dir/page.html

to mirror a complete website recursively(i.e. relink the downloaded files, for example, www.blogger.com type;

wget -mk http://www.blogger.com/
More support is available at gnu website, the official developers of wget
http://www.gnu.org

No comments:

Post a Comment