Using Wget with a Proxy: A Beginner’s Guide (+ Code Snippets)

Written by: Robin Geuens

In this post, you'll learn how to you Wget with a proxy.

If you want to jump to a specific section, you can use the links below

What is wget?

Wget is a free command-line tool similar to cURL. Wget is primarily used for retrieving data from the web. It's compatible with HTTP, HTTPS, and FTP protocols, and can even retrieve files through HTTP proxies.

Wget is like a Swiss army knife in a developer's toolkit, providing a plethora of functionalities. Here are a few ways you can leverage Wget:

  • Downloading files: Wget can download files from any website or server directly to your local machine. This is particularly useful when dealing with large files or datasets.

  • Mirroring websites: Wget can recursively download entire websites, making it ideal for creating offline versions of sites or backing up content.

  • Crawling web pages: With its ability to follow links in web pages, wget can also be used for web scraping and data extraction.

Wget is not a default component on all systems. Windows and Mac users, and even some Linux distributions, will need to install it manually. In this guide we won't go over how to install Wget.

But don't worry, there are abundant resources available online to walk you through the process of using package manager or installing it manually. For instance, check out this guide on How to Install and Use Wget on Mac and Windows or this Linux guide.

How to Use Wget With a Proxy?

Proxies act as intermediaries between your computer and the internet, and they can offer several advantages to Wget users.

  • Proxies can bypass geographical restrictions on content.

  • They can help avoid rate limits imposed by servers by distributing requests among multiple IP addresses.

  • Proxies provide an extra layer of anonymity by masking your IP address.

Wget can be configured to use a proxy in various ways. This involves setting up your proxy server details and then directing wget to route its requests through that proxy.

Configuring Proxy Settings Using Environment Variables

There are a couple of different ways to set up proxies in wget. Let's delve into the first method: exporting proxies.

Exporting Variables

Exporting proxies is as straightforward as defining environment variables.  You can specify your proxy configuration settings with these nifty commands:


export http_proxy=http://your-proxy-server-ip:port/   
export https_proxy=https://your-proxy-server-ip:port/   

This informs wget to use the designated IP address and port for your HTTP and HTTPS proxies respectively. Note: If you use Windows you'll have to use set instead of export

But what if you're seeking a solution that keeps working after you've closed your terminal session? That's where the .wgetrc file enters the scene.

Using a .wgetrc File

Think of the .wgetrc file as a personal assistant to wget. The wget configuration file holds settings that wget refers to every time it leaps into action.

Creating a .wgetrc file in Windows is a breeze. Simply head to your home directory (typically C:\Users\Your_Username) and conjure up a file named .wgetrc. On macOS, the procedure remains the same, but your home directory would be /Users/Your_Username.

Defining proxy variables in a .wgetrc file mirrors the process of exporting them:


http_proxy = http://your-proxy-server-ip:port/   
https_proxy = https://your-proxy-server-ip:port/   

Note: You can instruct wget to disregard proxies for specific domains. For instance, let's say you want to bypass the proxy for example.com. You can leverage the --no-proxy option like so:

wget --no-proxy=example.com   

Proxy Authentication with wget

Premium proxy providers often require a username and password for access. To use these proxies, it's essential to send your credentials along with your request. Thankfully, wget simplifies this process with the --proxy-user and --proxy-password options.

For instance:


wget --proxy-user=username --proxy-password=password

Alternatively, you can combine your username, password, IP, and port all at once in the environment variables we mentioned earlier:


export http_proxy=http://username:password@proxy-server-ip:port/   

Or you can include your username and password in the .wgetrc file for convenience.

Basic wget Commands

Downloading a Single File

The basic syntax for downloading a file using wget is:


wget [options] [URL]

Here, [options] is where you can append specific commands, and [URL] is the web address of the file you wish to download.

For instance, if you want to download a file from http://example.com/sample.pdf, you would use:


wget http://example.com/sample.pdf   

This command will download the sample.pdf file into your current directory.

What if your download gets interrupted? No problem! wget has a nifty -c option that allows you to resume your download. Just use the same command you started the download with, but add the -c option:


wget -c http://example.com/sample.pdf   

This command will resume the download of sample.pdf from where it was interrupted.

Wget is a versatile tool that's not just limited to downloading single files. In fact, it can be used to download multiple files at once, save files to specific directories, and even rename downloaded files. Let's explore these features in more detail.

Downloading Multiple Files

The syntax for downloading multiple files is quite similar to that for a single file, with the addition of the -i option. This option is followed by a text file that contains the URLs of the files you want to download.


wget -i filelist.txt  

In this example, filelist.txt is a text file containing a list of URLs, each on its own line. Here's a sample content of filelist.txt:


http://example.com/file1.pdf   
http://example.com/file2.pdf   
http://example.com/file3.pdf   

Now, you can run wget -i filelist.txt to simultaneously download all three files.

Saving a File to a Specific Directory

To dictate the exact path of your download, you can use the -P or --directory-prefix option. For instance, if you want to download a file to the /usr/local directory, you would use the following command:


wget -P /usr/local http://example.com/samplefile.zip

Renaming a Downloaded File

Renaming a downloaded file using wget is simple. You can use the -O option to specify a new name for your downloaded file. Here's how you would download an image from a website and rename it:


wget -O newimage.jpg http://example.com/image.jpg  

To avoid overwriting an existing file with the same name, you can use the -nc or --no-clobber option:


wget -nc http://example.com/image.jpg  

Changing the User-Agent with wget

The User-Agent is a distinctive identifier that your browser transmits to the server, declaring its type and version.

This might seem like a small detail, but it's actually quite significant because it can influence the response or behavior of the web service.

Some websites may even limit access based on the User-Agent.

Changing the User-Agent in wget is pretty straightforward. All you need to do is adjust the .wgetrc file. Here's how you do it: add or alter the line user_agent = "string", swapping out "string" with your preferred User-Agent. Here's an example:


user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"  

Alternatively, you could use the -U or --user-agent option to set the User-Agent string directly in the command line. Here's a quick example:


wget --user-agent="Mozilla/5.0" http://example.com  
 

Limiting Download Speed

It's a good idea to have some control over your download speed. This practice ensures network stability by preventing any one process, like downloading a large file, from hogging bandwidth and slowing down other processes.

This is particularly critical for developers who may be sharing a network with others or running several tasks at once. Thankfully, wget lets you limit download speed.

You can manage this by using the --limit-rate option. This allows you to specify the maximum transfer rate for data retrieval.

It's measured in bytes per second unless a K (for kilobytes per second) or M (for megabytes per second) is appended.

For instance, if you want to restrict the download speed to 10 KB/s while using wget and a proxy, you could use the following command:


wget --limit-rate=10k http://example.com  

In this command, --proxy-user and --proxy-password are used to set the proxy username and login credentials. The URL at the end (http://example.com) is the file or webpage you're aiming to download.

In some scenarios, you might also want to control the frequency of download requests. wget caters to this need with the --wait and --waitretry options. The --wait option makes wget pause between every retrieval, while --waitretry makes wget delay between retries of failed downloads.

To pause 1 second between requests, for instance, you could use the following command:


wget --wait=1 http://example.com  

Extracting Links from a Webpage

This feature comes in handy when you need to download multiple files or check the status of various links from a webpage.

Using wget to Download Files from a List of URLs
The --input-file option in wget allows you to download files from a list of URLs. The usage is straightforward: you create a text file with a list of URLs you want to download, and then pass this file to wget as an argument of the --input-file option.

Suppose you have a text file named urls.txt containing the following URLs:


http://example.com/file1   
http://example.com/file2   
http://example.com/file3   
 

You can command wget to download these files using the following command:


wget --input-file=urls.txt  

This command will download file1, file2, and file3 from example.com.

Parsing an HTML file with wget
If your input file is an HTML file and you want wget to treat it as such, use the --force-html option. When used, wget will parse the HTML file and follow the links found within.

For instance, if you have an HTML file named links.html containing links, you can extract them using this command:


wget --force-html --input-file=links.html

This command will make wget parse links.html, extract the links, and download the linked files.

Checking the Availability of Remote URLs
Finally, the --spider option can be used to check the availability of remote URLs without downloading them. This is useful when you want to verify links without consuming too much bandwidth.

To check the status of the links in urls.txt, you could use the following command:


wget --spider --input-file=urls.txt 

This command will crawl the URLs in urls.txt and print out the status of each URL.

Converting Links on a Page

Another significant advantage of using wget is its ability to convert links on a page. This feature is particularly useful when you download a webpage for offline use.

By converting links, you ensure that all internal navigation points to your local files instead of the original online sources.

Using wget to Convert Links
To use the link conversion feature, add the --convert-links option to your wget command. This will make wget adjust the links in downloaded HTML or CSS files to point to local files.

Here's an example of how it works:


wget --convert-links https://www.example.com   

This command downloads the webpage at www.example.com and converts all links to point to local files.

Adjusting File Extensions with wget
If you want the downloaded files to have suitable extensions, use the --adjust-extension option. It tells wget to save the downloaded files with the proper extension.

You can download a webpage and adjust its extension as follows:


wget --convert-links --adjust-extension https://www.example.com   

Downloading Page Requisites with wget
The --page-requisites option in wget ensures you download all the files necessary to properly display a given HTML page, including images and stylesheets.

Here's how you can use it:


wget --convert-links --page-requisites https://www.example.com   

Web Page Mirroring with Wget

Mirroring a webpage is a potent feature of wget that allows you to download a webpage along with all its resources.

This creates an offline mirror of the page, a feature that comes in handy for offline browsing, site backup, or even in-depth SEO analysis.

To mirror a webpage using wget, leverage the --mirror option. This option activates settings optimal for mirroring, like infinite recursion and time-stamping. Here's a simple example:


wget --mirror https://www.example.com   

Preventing Directory Ascension
To prevent wget from ascending to the parent directory while mirroring a webpage, use the --no-parent option. This option is especially useful when you want to limit your download to a specific section of a site. Here's how you do it:



wget --mirror --no-parent https://www.example.com/specific-section/   

Preserving File Timestamps
The --timestamping option helps maintain the original modification time of the files. This comes in handy when you need to keep the same file structure and metadata as the original web page. Here's the command for this:


wget --mirror --timestamping https://www.example.com   
 

Conclusion

Equipped with these wget techniques and options, you're now ready to download, convert, and mirror web content effectively.

Don't forget to practice and experiment with these options to become more adept at using wget.

Robin Geuens

Robin is the SEO specialist at SOAX. He likes learning new skills and automating things with Python and GPT. Outside of work he likes reading, playing videogames, and traveling.

Contact author