WGET

Mirroring a web site on Linux

You almost certainly have wget already. Try wget –help at the command line. If you get an error message, install wget with your Linux distribution's package manager. Or fetch it from the official wget page and compile your own copy from source.

Once you have wget installed correctly, the command line to mirror a web site is:

 wget -m -k -K -E http://url/of/web/site

See man wget or wget –help | more for a detailed explanation of each option.

If this command seems to run forever, there may be parts of the site that generate an infinite series of different URLs. You can combat this in many ways, the simplest being to use the -l option to specify how many links “away” from the home page wget should travel. For instance, -l 3 will refuse to download pages more than three clicks away from the home page. You'll have to experiment with different values for -l. Consult man wget for additional workarounds.

Note: some web servers may be set up to “punish” users who download too much, too fast. If you're not careful, using tools like wget could get your IP address banned from the site. You can avoid this problem by using the -w option to specify a delay, in seconds, between page downloads. Usually, this will prevent the web server from viewing your behavior as unacceptable. But your mileage may vary!

Advance options

 # Collect only the specific links listed line by line in 
 # the local file "my_movies.txt" 
 # Use a random wait of 0 to 33 seconds between files.
 # When there is a failure, retry for up to 22 times with 48 seconds 
 # between each retry.  Send no user-agent at all. Ignore robot exclusions.
 # Place all the captured files in the "/movies" directory
 # and collect the access results to the local file "my_movies.log"
 # Good for just downloading specific known images or other files.
 wget -t 22 --waitretry=48 --wait=33 --random-wait --user-agent=""
      -e robots=off -o ./my_movies.log -P/movies -i ./my_movies.txt