Learn Selenium web scraping: Step-by-step beginners’ guide

Selenium is an open-source framework that allows you to control a browser programmatically. It’s a powerful tool for extracting data from websites, especially if you want to scrape a website that relies heavily on JavaScript or has a complex structure (e.g., websites with dynamic content, interactive elements, or nested HTML).

This step-by-step guide is designed for beginners who want to start scraping with Selenium in Python. This guide assumes you already have some experience using Python and understand basic HTML structure. After following the steps in this guide, you will be able to move on to more complex Selenium scraping projects.

What is Selenium and when should you use it?

Selenium is an open-source tool that automates web browsers. It's useful for a variety of tasks, from web testing, to automating repetitive actions like scanning QR codes, and web scraping. In the context of web scraping, Selenium is particularly useful because it can handle dynamic content and interact with JavaScript elements like a human would.

Selenium’s ability to handle dynamic content and interact with JavaScript elements means you can use it to extract data from websites that constantly update (like social media feeds) or require user interaction, such as logging in or clicking buttons. For example, you could use Selenium to scrape product data from an ecommerce site, collect social media posts, or gather financial data from dynamic charts.

Prerequisites for web scraping with Selenium

Before we get started, there are a few things you will need:

Python (and a basic understanding of it)
Selenium
A web browser (e.g., Chrome)
The specific web driver for your browser (e.g., ChromeDriver)
Additional packages

Python

For this guide, it will be useful if you have a basic knowledge of Python, such as understanding variables, loops, functions, and data structures, as well as familiarity with Python packages and how to install them using pip.

You can download Python for free for macOS, Windows, or Linux from the Python website.

Selenium

Selenium can be installed using a pip command in your terminal:

pip install selenium

This will install the Selenium package in Python.

Web browser

You can use any web browser, but in this guide, we’ll use Chrome, which you can download on their official website.

You should also understand some browser basics, such as:

HTML structure including tags, attributes, and the Document Object Model (DOM)
A basic understanding of CSS selectors and XPath for locating elements

Web driver

You need to download a browser-specific web driver. For this project, we’ll be using Chrome, so you’ll need ChromeDriver, which you can download for free from the Chrome for Developers website.

Once you've downloaded ChromeDriver, you need to make it accessible to your Python scripts. There are two main ways to do this:

Add it to your system's PATH: This allows you to run ChromeDriver from any location. Instructions for adding to your PATH vary depending on your operating system (Windows, macOS, or Linux).
Place it in the same directory as your script: This is a simpler option, especially if you're just getting started. Download the ChromeDriver executable and place it in the same folder where you'll save your Python script.

Alternatively, you can use the webdriver_manager package to automatically handle ChromeDriver installation and updates. First, install webdriver_manager using pip:

pip install webdriver-manager

Then, use this code in your script:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

browser = webdriver.Chrome(ChromeDriverManager().install())

Additional packages

We recommend installing the following packages:

pip install webdriver-manager  # Simplifies webdriver management
pip install requests          # Useful for making HTTP requests
pip install beautifulsoup4    # A library for parsing HTML and XML

If you’re unsure whether you have these packages installed already, you can check using the terminal by typing:

pip freeze

This will list all the packages you’ve installed and their version number.

How to inspect a web page

To extract data with Selenium, you first need to identify the HTML elements that contain the data you're interested in. This means you need to inspect the webpage to find relevant CSS selectors or XPath expressions.

Step 1: Open developer tools

Right-click on the element you want to inspect and select Inspect (in most browsers like Chrome or Edge). This will open the developer tools panel and highlight the HTML code for the selected element.
Alternatively, press Ctrl+Shift+I (Windows/Linux) or Cmd+Option+I (Mac) to open developer tools.

Step 2: Examine the HTML

Once the element is highlighted:

Look at its tag name (e.g., <div>, <span>, <a>).
Check for attributes like id, class, or name. These attributes are often used to locate elements.

Example: For a button with the following HTML:

<button class="btn-submit" id="submit-button">Submit</button>

You can identify it by:

CSS Selector: .btn-submit (class) or #submit-button (id).
XPath: //button[@id='submit-button'] or //button[contains(@class, 'btn-submit')].

Step 3: Copy CSS Selector or XPath

Right-click on the element in the developer tools and select Copy → Copy selector or Copy XPath.
Paste it into your Selenium code as needed.

Step 4: Use Selenium to locate elements

Once you have the CSS selector or XPath, use the following methods in Selenium to locate elements.

To locate elements by CSS selector (using the same example as before) you can use:

element = browser.find_element(By.CSS_SELECTOR, ".btn-submit")

To locate elements by XPath, you can use:

element = browser.find_element(By.XPATH, "//button[@id='submit-button']")

Example: Finding quotes on a page

For a webpage like Quotes to Scrape, each quote is inside a <span> with the class text. You can locate these elements by inspecting its HTML:

<span class="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>

Then using the CSS selector .text or XPath //span[@class='text'] to locate it in Selenium:

quotes = browser.find_elements(By.CSS_SELECTOR, ".text")

for quote in quotes:
  print(quote.text)

Tips for inspecting webpages

As you hover over the HTML in developer tools, it will highlight the corresponding element on the page.
Press Ctrl+F (Windows/Linux) or Cmd+F (Mac) in developer tools to search for tags, classes, or text.
Paste your CSS selector or XPath into the search bar to verify it selects the correct elements.

How to scrape a website with Selenium

Now that you have Selenium and a webdriver set up, let’s dive into some basic Selenium concepts that you’ll need for web scraping.

Importing libraries

First, you need to import the necessary modules from the Selenium library. To work with the webdriver, enter the following import statement in your terminal:

from selenium import webdriver

Creating a driver instance

Next, create an instance of the webdriver. This object will be your interface for controlling the web browser. Here’s how you create a ChromeDriver instance:

browser = webdriver.Chrome()

If you’re using webdriver_manager, you would instead use:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

browser = webdriver.Chrome(ChromeDriverManager().install())

Navigating to a web page

To use Selenium to open a specific web page in the browser, use the browser.get() method. For this example, we’ll be scraping a for-purpose website called Quotes to Scrape, so let’s navigate to their homepage.

browser.get("https://quotes.toscrape.com/")

This code will open the Quotes to Scrape website (https://quotes.toscrape.com/) in your automated browser window, ready for you to start extracting data.

Locating elements

To extract data from Quotes to Scrape, you need to locate the specific elements on the page that contain the information you're interested in. Selenium uses locators to find elements. Here are some common locators:

By.ID: Locates an element by its unique id attribute.
By.CLASS_NAME: Finds elements with a specific CSS class name.
By.XPATH: Uses an XPath expression to locate an element based on its position in the HTML or its attributes.
By.CSS_SELECTOR: Uses a CSS selector to find elements.

You can use these locators with the following methods:

browser.find_element(locator, value): Finds a single element that matches the locator.
browser.find_elements(locator, value): Finds all elements that match the locator. This returns a list of web elements.

On the Quotes to Scrape homepage, if you inspect the HTML, you'll notice that each quote is contained within a div element with the class quote. You can use this information to locate all the quote elements:

from selenium import webdriver
from selenium.webdriver.common.by import By

# Initialize the WebDriver
driver = webdriver.Chrome()

try:
  # Navigate to the quotes page
  driver.get("https://quotes.toscrape.com/")

  # Find all quote elements on the page
  quotes = driver.find_elements(By.CLASS_NAME, "quote")

  # ... rest of the code to extract data from quotes ...

except Exception as e:
  print(f"Error occurred: {e}")

finally:
  # Ensure the driver is closed
  driver.quit()

Explanation

WebDriver initialization: A Chrome WebDriver instance is created to control the browser. Make sure you have the Chrome WebDriver installed and available in your system's PATH.
Page navigation: The script navigates to the specified URL using driver.get().
Finding elements: The script uses find_elements to locate all elements with the class name "quote". Each of these elements contains a quote and its author.

The find_elements() method (note the plural "elements") returns a list of all the elements that match the locator. In this case, it will give you a list of all the div elements with the class quote.

If you wanted to find a single element, you would use find_element() (singular).

Extracting data

Now that you have a list of quote elements, you can extract the actual quote text and author information.

You can use element.text to get the text of an element. For example, to extract the text of the first quote, you can use:

first_quote = quotes[0].find_element(By.CSS_SELECTOR, "span.text").text
print(first_quote)

This code will find the span element with the class text within the first quote element and print its text content.

To extract the value of an attribute from an element, use element.get_attribute(‘attribute name’). For example, to get the author of the first quote, you can use:

first_author = quotes[0].find_element(By.CLASS_NAME, "author").text
print(first_author)

This code will find the element with the class author within the first quote element and print its text content, which is the author’s name.

By combining locators and these extraction methods, you can effectively scrape the quotes and author information from the Quotes to Scrape website.

Handle pagination (if needed)

Many websites (including Quotes to Scrape) split their content across multiple pages to improve loading speed and user experience. This is called pagination. If you want to scrape all the quotes from Quotes to Scrape – not just the ones on the first page – your Selenium scraper needs to be able to automatically move from one page to the next.

Here’s how to handle pagination with Selenium:

Inspect the page to find the element that acts as the Next button or link. On "Quotes to Scrape," it's an <a> tag with the class next.
Use Selenium to click the Next button to navigate to the next page.
Use a loop to repeat the process of locating quotes, extracting data, and clicking the Next button until you reach the end of the pages.

Here’s an example:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

# Initialize the WebDriver
driver = webdriver.Chrome()

try:
  # Navigate to the quotes page
  driver.get("https://quotes.toscrape.com/")

  while True:
    # Find all quote elements on the current page
    quotes = driver.find_elements(By.CLASS_NAME, "quote")

    # Iterate over each quote element and extract the text and author
    for quote in quotes:
      text = quote.find_element(By.CLASS_NAME, "text").text
      author = quote.find_element(By.CLASS_NAME, "author").text
      print(f"Quote: {text}\nAuthor: {author}\n")

    try:
      # Find the "Next" button and click it to go to the next page
      next_button = driver.find_element(By.LINK_TEXT, "Next")
      next_button.click()

    except NoSuchElementException:
      # If there is no "Next" button, break the loop
      break

except Exception as e:
  print(f"Error occurred: {e}")

finally:
  # Ensure the driver is closed
  driver.quit()

Explanation

The script uses a while True loop to continuously scrape quotes from each page. On each page, it finds all elements with the class name "quote" and extracts the text and author.

Pagination handling:

The script looks for a Next button using find_element(By.LINK_TEXT, "Next").
If the Next button is found, it clicks the button to navigate to the next page.
If the Next button is not found (indicating the last page), a NoSuchElementException is raised, and the loop breaks.

Store the data

Once you've extracted the data, you'll want to store it in a structured format (like a database or CSV file) for later use.

CSV file

CSV (Comma Separated Values) is a simple and widely supported format. You can use Python's csv module to write data to a CSV file.

import csv

# ... (your code to extract quotes and authors)

with open('quotes.csv', 'w', newline='', encoding='utf-8') as csvfile:
  writer = csv.writer(csvfile)
  writer.writerow(['Quote', 'Author'])  # Write header row
  for quote, author in zip(all_quotes, all_authors):  # Assuming you have lists of quotes and authors
    writer.writerow([quote, author])

Database

For larger datasets or more complex storage needs, you can use a database (like SQLite, MySQL, or PostgreSQL). You'll need to use a database connector library (like sqlite3 for SQLite) to interact with the database.

import sqlite3

# ... (your code to extract quotes and authors)

conn = sqlite3.connect('quotes.db')
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS quotes
              (id INTEGER PRIMARY KEY AUTOINCREMENT, quote TEXT, author TEXT)''')
for quote, author in zip(all_quotes, all_authors):
  cursor.execute("INSERT INTO quotes (quote, author) VALUES (?, ?)", (quote, author))
conn.commit()
conn.close()

You can adapt this example to your specific needs and the format you prefer for storing the data you’ve scraped.

Wrapping up your first Selenium scraper

Congratulations! You've just completed a fully functional web scraping project with Selenium. By following this guide, you’ve learned how to extract, clean, and store data programmatically. Now you’ve stored your data, your project is complete. You can further manipulate or analyze your data, or you can use your new tools to tackle similar challenges on other websites.

How to scrape images

Another website built for scraping is books.toscrape.com. This is a fictional bookstore with images of each book cover. You can use Selenium to scrape images from Books to Scrape and store them in a database, following these steps:

Scrape image URLs: Use Selenium to navigate the site and extract image URLs
Download images: Use the requests library to download images from the URLs
Store images in a database: Use SQLite to store image data as binary blobs

First, ensure you have the necessary libraries installed:

pip install selenium requests sqlite3

Code example

import os
import requests
import sqlite3
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

# Initialize the WebDriver
driver = webdriver.Chrome()

# Connect to SQLite database (or create it)
conn = sqlite3.connect('book_images.db')
cursor = conn.cursor()

# Create a table to store images
cursor.execute('''
  CREATE TABLE IF NOT EXISTS images (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    url TEXT,
    image BLOB
  )
''')

def download_image(url):
  response = requests.get(url)
  if response.status_code == 200:
    return response.content
  return None

def save_image_to_db(url, image_data):
  cursor.execute('INSERT INTO images (url, image) VALUES (?, ?)', (url, image_data))
  conn.commit()

try:
  # Navigate to the books page
  driver.get("http://books.toscrape.com/")

  while True:
    # Find all book elements on the current page
    books = driver.find_elements(By.CLASS_NAME, "product_pod")

    # Iterate over each book element and extract the image URL
    for book in books:
      img_element = book.find_element(By.TAG_NAME, "img")
      img_url = img_element.get_attribute('src')
      print(f"Downloading image: {img_url}")
      image_data = download_image(img_url)
      if image_data:
        save_image_to_db(img_url, image_data)

    try:
      # Find the "Next" button and click it to go to the next page
      next_button = driver.find_element(By.CLASS_NAME, "next")
      next_button.find_element(By.TAG_NAME, "a").click()

    except NoSuchElementException:
      # If there is no "Next" button, break the loop
      print("No more pages to navigate.")
      break

except Exception as e:
  print(f"Error occurred: {e}")

finally:
  # Ensure the driver is closed
  driver.quit()

  # Close the database connection
  conn.close()

Explanation

WebDriver initialization: A Chrome WebDriver instance is created to control the browser
Database setup: Connect to an SQLite database and create a table to store images if it doesn't already exist
Image downloading: Use the requests library to download images from their URLs
Storing images: Save the image data as binary blobs in the SQLite database
Pagination handling: The script navigates through pages using the "Next" button until the last page is reached
Error handling: Basic error handling is included to catch exceptions during the process
Cleanup: The WebDriver and database connection are closed after the script completes

Considerations

Database size: Storing images as blobs can increase the database size significantly. Consider storing image paths instead if you have a large number of images.
Network and performance: Downloading images can be network-intensive and slow, especially for large images or many images. Consider optimizing the download process if needed.
Image format: Ensure that the images are in a format that can be stored and retrieved correctly from the database.

How to access scraped images

To view the images stored in your SQLite database, you need to extract the binary data from the database and save it as image files on your local filesystem. Here's how you can do that:

import sqlite3
import os

# Connect to the SQLite database
conn = sqlite3.connect('book_images.db')
cursor = conn.cursor()

# Create a directory to save extracted images
os.makedirs('extracted_images', exist_ok=True)

# Query to select all images from the database
cursor.execute('SELECT id, url, image FROM images')

# Iterate over each image record
for record in cursor.fetchall():
  image_id, url, image_data = record

  # Extract the image name from the URL
  image_name = url.split('/')[-1]

  # Save the image to the local filesystem
  with open(f'extracted_images/{image_name}', 'wb') as img_file:
    img_file.write(image_data)

  print(f"Extracted {image_name}")

# Close the database connection
conn.close()

After running the above script, you will find the extracted images in the extracted_images directory. You can open these images using any image viewer on your computer.

Explanation

Connect to the SQLite database where the images are stored
Directory creation: Create a directory named extracted_images to store the extracted image files
Execute a SQL query to select all image records from the database
Iterate over each record, extracting the image ID, URL, and binary data
Use the URL to determine the image file name
Write the binary data to a file in the extracted_images directory
Close the database connection after the extraction process is complete

Common web scraping challenges and error handling

As you start working on more web scraping projects, you'll likely encounter some common challenges. Here are a few of them, and how to solve them.

IP blocking

Websites often detect and block repeated requests from the same IP address, as this can indicate automated activity. To avoid this, you can use proxies to rotate your IP address and make your requests appear to come from different locations.

If you want your requests to appear as if they are coming from a real user (instead of your automated script) you should use residential proxies that use real residential IP addresses to mask your traffic.

If your proxies require authentication, you'll need to handle that separately, as Selenium's proxy setup doesn't directly support proxy authentication.

Timeouts and slow responses

Use try-except blocks to handle cases where elements aren't found within the specified time:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Initialize the WebDriver
driver = webdriver.Chrome()

try:
  # Navigate to the desired page
  driver.get("https://example.com")

  # Wait for a specific element to appear
  element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "target_element_id"))
  )
  print("Element found:", element)

except Exception as e:
  print(f"Error occurred: {e}")

finally:
  # Ensure the driver is closed
  driver.quit()

Explanation of the code

This code uses Selenium to wait for a specific element to be present in the DOM of a web page.

WebDriver Initialization: A Chrome WebDriver instance is created to control the browser.
Page Navigation: The browser navigates to the specified URL.
Waiting for Element: WebDriverWait is used to wait up to 10 seconds for the element with the specified ID to be present in the DOM.
Exception Handling: If the element is not found within the timeout, an exception is caught, and an error message is printed.
Driver Cleanup: The finally block ensures that the browser is closed after the script completes, regardless of whether an exception occurred.

Rate limiting

Many websites implement rate limits to restrict the number of requests you can make within a specific timeframe. This helps prevent their servers from being overloaded. To respect rate limits, you can add delays between your requests using time.sleep() to mimic human browsing behavior.

CAPTCHAs

CAPTCHAs are those distorted text images or puzzles that websites use to verify that a user is human. Solving CAPTCHAs can be tricky with Selenium. There are some libraries and services that can help automate CAPTCHA solving, but they might not always be reliable.

Dynamic content

Websites that heavily rely on JavaScript to load content can be more challenging to scrape with Selenium. Sometimes, the content you want to extract might not be immediately available in the page source, as it's loaded dynamically. You might need to use techniques like waiting for elements to load or combining Selenium with other libraries to handle dynamic content effectively.

Here’s how to wait for elements to load with Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://example.com/dynamic-content-page")

try:
  # Wait for a specific element to appear
  content = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "dynamic-content"))
  )
  print(content.text) # Extract text from the dynamically loaded element

except Exception as e:
  print("Content not found:", e)

driver.quit()

Explanation of the code

WebDriver Initialization: A Chrome WebDriver instance is created to control the browser.
Page Navigation: The browser navigates to the specified URL.
Waiting for Element: WebDriverWait is used to wait up to 10 seconds for the element with ID dynamic-content to be present in the DOM.
Extracting Text: Once the element is found, its text content is printed.
Exception Handling: If the element is not found within the timeout, an exception is caught, and an error message is printed.
Driver Cleanup: The driver.quit() call ensures that the browser is closed after the script completes

Next steps

Congratulations on completing your first web scraping project with Selenium! Now that you have a basic understanding of the fundamentals, here are some ideas for expanding your skills and taking on more ambitious projects:

Dive deeper into Selenium's features. Learn how to handle cookies, user logins, and interact with more complex website elements.
Explore how to use Selenium in conjunction with other libraries like BeautifulSoup for parsing HTML or Scrapy for building more advanced web scraping frameworks.
Apply your web scraping skills to real-world problems. Extract data for market research, price comparison, or any other project that interests you. Remember to always respect website terms of service and scrape ethically.

Scraping with SOAX

Now that you’ve mastered the basics, it’s time to scale your scraper for larger projects. Consider integrating advanced tools like SOAX’s proxy services and scraping APIs to handle challenges such as IP bans, CAPTCHAs, and geographic restrictions. With SOAX, you can:

Access a global pool of residential, mobile, ISP, and datacenter proxies for uninterrupted scraping
Use smart scraper APIs to automate and simplify data extraction
Reduce developer time with ready-to-use scraping solutions

Start experimenting by applying what you’ve learned to real-world projects. With the right tools and techniques, you’ll unlock data at scale while staying efficient and compliant.

How to scrape a website with Selenium: Step-by-step guide for beginners

What is Selenium and when should you use it?

Prerequisites for web scraping with Selenium

Python

Selenium

Web browser

Web driver

Additional packages

How to inspect a web page

Step 1: Open developer tools

Step 2: Examine the HTML

Step 3: Copy CSS Selector or XPath

Step 4: Use Selenium to locate elements

Example: Finding quotes on a page

Tips for inspecting webpages

How to scrape a website with Selenium

Importing libraries

Creating a driver instance

Navigating to a web page

Locating elements

Explanation

Extracting data

Handle pagination (if needed)

Explanation

Store the data

CSV file

Database

Wrapping up your first Selenium scraper

How to scrape images

Code example

Explanation

Considerations

How to access scraped images

Explanation

Common web scraping challenges and error handling

IP blocking

Timeouts and slow responses

Explanation of the code

Rate limiting

CAPTCHAs

Dynamic content

Explanation of the code

Next steps

Scraping with SOAX

Lisa Whelan

Related posts