How to use wget in Python with proxies: Step-by-step guide

Written by: John Fáwọlé

When downloading files or scraping data from the web, you might find it helpful to route your traffic through a proxy. Using a proxy with wget offers lots of advantages, like bypassing geographic restrictions, avoiding rate limits imposed by websites, and enhancing your anonymity. 

In this guide, we'll explore the benefits of using a proxy with wget and show you how to configure your proxy settings efficiently, to help you make the most of your downloads.

Why use wget for web scraping?

wget is a command-line utility used for retrieving content from web servers. Its main purpose is to automate the downloading of files from the internet. The name wget comes from "web get."

Unlike more complex web scraping tools that parse HTML or interact dynamically with web pages, wget excels in scenarios where the goal is to download files or entire web pages quickly. This makes it ideal for gathering static content like images, documents, or HTML files for offline analysis.

Modern websites often require authentication or custom headers for accessing resources. wget supports these requirements with options to include cookies, HTTP headers, and credentials. This enables it to scrape protected content when proper permissions are in place.

Python developers can integrate wget with Python scripts to automatically download files. While Python has libraries like requests and BeautifulSoup for web scraping, using wget in a Python script can simplify tasks like bulk file downloads and mirroring websites. The subprocess module in Python allows easy execution of wget commands, combining the best of both tools.

What makes wget helpful in web scraping?

  • Supporting various protocols
  • Resuming interrupted downloads
  • Bandwidth control
  • Wildcard support
  • Recursive downloads
  • Robot exclusion compliance

Support for various protocols (HTTP, HTTPS, FTP)

wget is a versatile tool that supports multiple internet protocols like HTTP, HTTPS, and FTP. This allows it to retrieve files from a variety of sources, whether they are standard web servers, secure websites, or file transfer servers. HTTPS support ensures secure data transfers by handling SSL/TLS encryption, making it reliable for accessing modern, secure web resources.

wget automatically adapts to protocol-specific challenges. For instance, it can follow HTTP redirects to locate relocated resources or traverse directory structures on FTP servers. This eliminates the need for multiple tools, enabling you to handle different types of downloads efficiently.

Whether fetching datasets from HTTPS APIs or pulling archives from FTP repositories, wget integrates seamlessly into automated workflows. Its compatibility with both legacy and modern protocols ensures it remains a useful tool for reliably handling diverse web content.

Resuming interrupted downloads

wget can resume interrupted downloads, making it useful for unreliable networks or large file transfers. By using the -c flag, you can continue a download from where it left off, preventing wasted bandwidth and time. This is particularly helpful when you’re downloading massive datasets or software packages.

The tool achieves this by verifying the partial file and requesting only the remaining portion from the server. This feature works seamlessly even after system restarts, as wget retains metadata for incomplete downloads, ensuring continuity.

This capability also allows you to manage bandwidth efficiently. For example, you can pause and resume large downloads in segments to comply with network limitations or quotas. This flexibility makes wget a reliable choice, especially if you have dynamic download needs.

Bandwidth control

wget includes options for controlling bandwidth usage, allowing you to limit download speeds with the --limit-rate flag. This feature is especially useful in shared network environments, where downloading large files could otherwise disrupt other people on the network. By setting a maximum download rate, you can ensure fair resource distribution.

This functionality is also helpful for avoiding detection or throttling by web servers. When scraping large amounts of data, high-speed downloads may trigger server defenses or IP bans. Bandwidth control enables ethical and stealthy downloading by simulating human browsing behavior.

If you’re a developer or a researcher, you’ll find the bandwidth control feature useful for optimizing network usage while running multiple concurrent tasks. By allocating bandwidth effectively, wget ensures that you don’t interrupt any critical processes, which helps you to balance efficiency and network management.

Wildcard support

With its wildcard support, wget allows you to download multiple files matching a specific pattern. For example, you can use wildcards in file names or URLs to fetch all files of a particular type, such as .jpg or .csv, from a directory. This simplifies the process of bulk downloading without needing to specify each file individually.

Wildcard support is particularly useful on FTP servers, where files in directories may not have predictable naming conventions. Instead of manually fetching files, you can automate the process by specifying patterns, significantly reducing time and effort.

This capability is useful for organizing large datasets or downloading related files in bulk. By combining wildcard support with other features like recursive downloading, wget becomes a powerful tool for managing large-scale download operations efficiently.

Recursive downloads

One of wget's most powerful features is recursive downloading, which enables you to download entire websites or directories. With the -r flag, wget follows links on a webpage to fetch all linked resources, such as HTML files, images, and stylesheets, creating a local copy of the website structure.

This helps you to create offline backups or analyze website content without an internet connection. You can also customize recursive downloads with depth limits, allowing you to specify how many levels of links to follow. This ensures your downloads remain focused while avoiding unnecessary data retrieval.

By combining recursive downloads with features like rate limiting and file type filtering, wget becomes an efficient solution for tasks like mirroring websites or collecting specific web content for research or development purposes.

Robot exclusion compliance

wget respects the robots.txt file (which outlines the parts of a website that are off-limits to automated tools) by default. This adherence to robot exclusion protocols makes wget a responsible tool for downloading web content while respecting website owners' preferences.

You can override this behavior if necessary, but the default compliance ensures ethical scraping practices. This reduces the risk of violating terms of service or encountering legal issues when interacting with web resources.

wget's robot exclusion compliance provides peace of mind for developers and researchers. It balances powerful downloading capabilities with responsible usage, which makes it a trustworthy tool for web interactions.

Why use wget with Python?

Python’s flexibility as a programming language complements wget’s simplicity and efficiency in downloading resources. 

By integrating wget commands into Python scripts, you can automate, customize, and scale your workflows for a wide range of use cases, from simple file retrieval to complex web scraping projects.

While Python has libraries like requests and BeautifulSoup for web scraping and HTTP requests, wget excels in tasks that involve downloading large datasets, mirroring websites, or managing bulk file retrieval.

Using Python as a control layer for wget allows you to combine the strengths of both tools, offering enhanced automation and adaptability in handling diverse web resources.

Automate downloads

Python’s scripting capabilities make it an ideal partner for automating wget downloads. By embedding wget commands in Python scripts using modules like subprocess, you can schedule and repeat download tasks without manual intervention. 

For example, Python can dynamically generate download URLs or read them from a database, passing them to wget for execution.

import subprocess

urls = ["http://example.com/file1", "http://example.com/file2"]
for url in urls:
    subprocess.run(["wget", url])

This automation is particularly useful for repetitive tasks, such as downloading daily reports, fetching periodic data updates, or mirroring websites regularly. With Python managing the logic and wget executing the downloads, you can develop a reliable, efficient workflow that saves time and effort.

Customize wget options through Python scripts

Python enhances the usability of wget by allowing developers to customize and dynamically configure its options. Parameters like download rate limiting, recursive depth, or user-agent strings can be adjusted in Python code and passed to wget as arguments. 

For example, you can program Python scripts to add authentication tokens, cookies, or headers required for accessing protected resources, leveraging wget’s built-in support for these features.

import subprocess

url = "http://example.com/protected-file"
token = "my-auth-token"
subprocess.run(["wget", "--header", f"Authorization: Bearer {token}", url])

This level of customization is useful for handling diverse tasks like downloading content from multiple domains or filtering specific file types.

Ease of use and integration into existing Python workflows

Python’s simplicity and wide adoption make integrating wget commands into existing workflows straightforward. By using Python modules like os or subprocess, you can invoke wget commands directly within your scripts, so it blends easily with other parts of your codebase.

import os

os.system("wget http://example.com/data.csv")

This integration allows you to chain wget tasks with other Python processes, such as parsing downloaded files, updating databases, or triggering subsequent operations.

How to install wget

Linux

Most Linux distributions include wget by default. If it’s not installed, you can use the package manager for your distribution:

sudo apt update && sudo apt install wget

Fedora/RHEL:

sudo dnf install wget

Arch Linux:

sudo pacman -S wget

To verify installation:

wget --version

macOS

Install wget using Homebrew, a popular package manager for macOS.

Ensure Homebrew is installed. If not, install it with:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then install wget:

brew install wget

Verify the installation:

wget --version

Windows

wget is not included by default on Windows, but you can install it either via Chocolatey or by manually downloading it from the GNU website.

1. Install via Chocolatey:

Set-ExecutionPolicy Bypass -Scope Process -Force;
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072;
iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))

Then:

choco install wget

2. Manually download:

Visit the GNU wget downloads page and download the latest version for Windows.

Extract the files and add the wget.exe file to your system’s PATH for easy command-line use.

Verify installation:

wget --version

How to install Python

Linux

sudo apt update && sudo apt install python3 python3-pip

Verify Python installation:

python3 --version
pip3 --version

macOS

Install Python using Homebrew:

brew install python

Verify the installation:

python3 --version
pip3 --version

Windows

Download the latest Python version from the official Python website.

Verify installation:

python --version
pip --version

Running wget with Python: Step-by-step guide

The wget utility can be run from Python by leveraging the subprocess module, which allows Python to execute shell commands. This integration enables the automation of wget tasks, such as downloading files or mirroring websites, directly from Python scripts, making it ideal for combining Python's logic with wget's efficiency.

Using the subprocess module

The subprocess module provides powerful tools to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. To run wget commands, the subprocess.Popen class is a versatile choice. It allows you to execute commands, capture their output, and handle any errors that may occur.
Here’s an example of how to create a reusable function to execute shell commands, including wget, using subprocess.Popen.

Implementing the execute_command function

import subprocess

def execute_command(command):
    """Executes a shell command and captures its output and errors."""
    try:
        # Run the command
        process = subprocess.Popen(
            command,  # The command to execute
            shell=True,  # Use the shell to interpret the command
            stdout=subprocess.PIPE,  # Capture standard output
            stderr=subprocess.PIPE  # Capture standard error
        )

        # Communicate with the process
        stdout, stderr = process.communicate()

        # Decode outputs
        stdout = stdout.decode("utf-8")
        stderr = stderr.decode("utf-8")

        if process.returncode != 0:  # Check for errors
            print(f"Error executing command: {stderr}")
        else:
            print(f"Command output: {stdout}")

        return stdout, stderr, process.returncode

    except Exception as e:
        print(f"An exception occurred: {e}")
        return None, None, -1

Key arguments in suprocess.Popen

  • command: A string containing the shell command to be executed, such as a wget command.
  • shell=True: When set to True, this allows the command to be interpreted by the shell. This is especially useful for commands that include pipes, redirections, or complex syntax.
  • stdout=subprocess.PIPE: Redirects the command’s standard output (e.g., successful messages) to a Python object, enabling the script to capture and process it.
  • stderr=subprocess.PIPE: Redirects the command’s error output (e.g., warnings or errors) similarly, allowing the script to handle them programmatically.

Example: Downloading a file using wget

# Example: Downloading a file using wget
url = "https://example.com/file.zip"
command = f"wget {url}"
stdout, stderr, returncode = execute_command(command)

if returncode == 0:
    print("Download successful!")
else:
    print("Download failed!")

This example combines Python’s flexibility with wget’s capabilities, so you can efficiently command execution while maintaining error-handling and logging capabilities.

wget top use cases

Downloading a single file

Using wget in Python, you can easily automate the download of a single file. For example:

import subprocess

url = "https://example.com/file.zip"
command = f"wget {url} -O downloaded_file.zip"
subprocess.run(command, shell=True)

This command downloads a file from the specified URL and saves it with a custom name (downloaded_file.zip) using the -O option. The output of wget typically includes details like download progress, transfer speed, and time remaining, making it easy to monitor the process. Additionally, the --directory-prefix (-P) option allows specifying a folder for saving files, e.g., -P ./downloads.

Downloading a webpage

wget provides a straightforward way to download a webpage for offline use. Here's an example:

url = "https://example.com"
command = f"wget {url} -O webpage.html"
subprocess.run(command, shell=True)

For more complex use cases, such as mirroring a website, you can use wget's recursive download feature.

Recursive downloading fetches the specified page as well as also linked resources like images, stylesheets, and scripts. The --recursive (-r) option enables this and it can replicate the structure of a website for offline browsing.

Downloading with timestamps

The --timestamping (-N) option in wget makes sure you only download files if they are newer than your local copies. This is especially useful for maintaining up-to-date backups without redundant downloads.

url = "https://example.com/daily-report.csv"
command = f"wget {url} -N"
subprocess.run(command, shell=True)

With this option, wget compares the server's file modification time with the local file and downloads the file only if it has been updated, optimizing bandwidth and storage.

Resuming interrupted downloads

Network interruptions can halt downloads midway, but the --continue (-c) option allows you to easily resume incomplete downloads. For instance:

url = "https://example.com/large-file.zip"
command = f"wget {url} -c"
subprocess.run(command, shell=True)

This option appends data to the partially downloaded file rather than starting over, saving time and bandwidth. It's particularly useful for downloading large files over unreliable connections.

Downloading an entire website

To mirror an entire website for offline use, recursive downloading with wget is the go-to method. The options --recursive (-r), --level (-l), and --convert-links (-k) work together to achieve this:

url = "https://example.com"
command = f"wget -r -l 2 -k {url}"
subprocess.run(command, shell=True)
  • -r: Enables recursive downloading
  • -l 2: Limits the recursion depth to two levels of links
  • -k: Converts links in downloaded HTML files for offline browsing

This combination downloads the website structure up to the depth you've specified while ensuring the links are accessible locally, making it perfect for creating offline copies of web content.

Getting into wget options (advanced)

Changing the user agent

The User-Agent is a string sent by clients (browsers or tools like wget) to web servers to identify themselves. This information helps servers deliver appropriate content or block unwanted traffic. In web scraping, modifying the User-Agent can help bypass blocks or emulate specific browsers.

To change the User-Agent in wget, use the -U or --user-agent option:

command = "wget -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' https://example.com" 
subprocess.run(command, shell=True)

Alternatively, you can configure the User-Agent globally in the .wgetrc file:

echo 'user_agent = Mozilla/5.0 (Windows NT 10.0; Win64; x64)' >> ~/.wgetrc

This ensures wget always uses the specified User-Agent without needing to specify it in commands. Customizing the User-Agent is crucial for ethical web scraping while adhering to website policies.

Limiting download speed

The --limit-rate option allows you to control the bandwidth used by wget, which is especially useful to avoid overloading servers or managing shared internet connections. For instance, to cap download speed at 500 KB/s:

command = "wget --limit-rate=500k https://example.com/large-file.zip"
subprocess.run(command, shell=True)

This ensures the download doesn’t exceed the specified speed, helping maintain network performance. You can specify rates in kilobytes (e.g., 500k), megabytes (2m), or bytes. Limiting bandwidth is a best practice for respectful and responsible usage when downloading from external servers.

Extracting links from a webpage

Using the --input-file option

The --input-file option lets you supply a text file containing a list of URLs to download:

with open("urls.txt", "w") as file:
    file.write("https://example.com/file1\n")
    file.write("https://example.com/file2\n")

command = "wget --input-file=urls.txt"
subprocess.run(command, shell=True)

This is ideal for bulk downloading, automating tasks such as fetching multiple files from a dataset or media library.

Using the --spider option

The --spider option is a "no-download" mode that verifies the availability of links without downloading their content:

command = "wget --spider https://example.com"
subprocess.run(command, shell=True)

Limitations of wget and good alternatives

While wget is a versatile tool for downloading files and mirroring websites, certain scenarios may call for more specialized tools. Depending on your use case, other options like curl, Beautiful Soup, or Selenium might better meet your needs, especially for tasks beyond simple downloads.

curl for advanced HTTP requests

For situations where you need to make advanced HTTP requests, such as sending custom headers, managing cookies, or interacting with APIs, curl is often a better choice. Unlike wget, curl excels at fine-grained control over HTTP methods like POST, PUT, and DELETE. For example:

curl -X POST -H "Authorization: Bearer <token>" -d "param=value" https://example.com/api

Its flexibility and support for various protocols make curl ideal for testing endpoints or interacting with APIs programmatically.

Beautiful Soup for parsing HTML

If your goal is to extract specific data from a webpage, such as text or links, wget alone won’t suffice. In such cases, Beautiful Soup, a Python library for parsing and navigating HTML, provides better functionality. It allows you to focus on scraping structured data:

from bs4 import BeautifulSoup
import requests

response = requests.get("https://example.com")
soup = BeautifulSoup(response.content, "html.parser")

for link in soup.find_all("a"):
    print(link.get("href"))

While wget downloads entire files or websites, Beautiful Soup focuses on understanding and extracting specific information from HTML, making it more suitable for web scraping.

Selenium for browser automation

Dynamic websites that heavily rely on JavaScript for rendering content can pose challenges for wget. In such cases, Selenium shines as a browser automation tool, capable of interacting with web elements like buttons, forms, and dynamic content. For example, it can load JavaScript-heavy pages, take screenshots, or simulate user actions:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://example.com")
content = driver.page_source
print(content)

driver.quit()

Selenium is especially useful for tasks like filling forms or scraping content that wget cannot access. For an in-depth guide, check out this Selenium scraping blog post. You can also explore the Selenium glossary for key concepts.

Using wget with a proxy

Using a proxy with wget can offer various benefits, such as bypassing geo-restrictions, avoiding rate limits, and enhancing anonymity during downloads. Proxies can also help when scraping content from websites that limit access based on IP addresses, or when you want to simulate browsing from different geographic locations.

Benefits of using a proxy with wget

Bypassing geo-restrictions

Some websites limit access based on geographical locations, blocking IPs from certain countries. By using a proxy, you can mask your real IP address and appear to access the website from a different location, bypassing these geo-restrictions. This can be especially useful when downloading content that is only available in specific regions.

Avoiding rate limits

Websites may impose rate limits to prevent excessive requests from a single IP address. Using a proxy allows you to distribute requests across multiple IP addresses, reducing the chance of being throttled or blocked. This is particularly valuable when downloading large volumes of data or scraping websites with strict rate limits.

Enhancing anonymity

By routing your requests through a proxy server, you can add a layer of anonymity, as the website will only see the proxy's IP address rather than your own. This can be important for privacy or security reasons when downloading files or scraping websites.

Configuring proxy settings

Using environment variables

One way to configure a proxy for wget is by setting environment variables for http_proxy and https_proxy. These variables inform wget of the proxy server to use for HTTP and HTTPS requests, respectively. To set these variables in your terminal session:

export http_proxy=http://proxyserver:port
export https_proxy=http://proxyserver:port

Using the .wgetrc file

For persistent proxy settings across all wget sessions, you can configure the .wgetrc file. This file is located in your home directory (~/.wgetrc) and stores user-specific wget configurations. Add the following lines to the .wgetrc file to set the proxy:

http_proxy = http://proxyserver:port
https_proxy = http://proxyserver:port

This method ensures that all wget commands use the proxy without needing to set environment variables manually each time.

Proxy authentication

If your proxy requires authentication, you can provide the username and password directly in the wget command or through the .wgetrc file.

Using Command-Line options

You can authenticate with the proxy using the --proxy-user and --proxy-password options:

wget --proxy-user=username --proxy-password=password http://example.com/file.zip

Using environment variables or .wgetrc

Alternatively, you can set the http_proxy and https_proxy environment variables to include the username and password, like so:

export http_proxy=http://username:password@proxyserver:port
export https_proxy=http://username:password@proxyserver:port

Or, in your .wgetrc file, you can specify the proxy with credentials:

http_proxy = http://username:password@proxyserver:port
https_proxy = http://username:password@proxyserver:port

Using these methods, you can securely authenticate with your proxy without manually entering credentials each time.

Conclusion

Integrating a proxy with wget can really help you to manage your downloads and scrape web data. Whether you're bypassing geo-restrictions or evading rate limits, a properly configured proxy can help you achieve these goals more effectively. Access a pool of more than 191 million whitelisted IPs with a three-day trial with SOAX for just $1.99.

John Fáwọlé

John Fáwọlé is a technical writer and developer. He currently works as a freelance content marketer and consultant for tech startups.

Contact author