Selenium is an open-source framework that allows you to control a browser programmatically. It’s a powerful tool for extracting data from websites, especially if you want to scrape a website that relies heavily on JavaScript or has a complex structure (e.g., websites with dynamic content, interactive elements, or nested HTML).
This step-by-step guide is designed for beginners who want to start scraping with Selenium in Python. This guide assumes you already have some experience using Python and understand basic HTML structure. After following the steps in this guide, you will be able to move on to more complex Selenium scraping projects.
What is Selenium and when should you use it?
Selenium is an open-source tool that automates web browsers. It's useful for a variety of tasks, including web testing, automating repetitive actions, and web scraping. In the context of web scraping, Selenium is particularly useful because it can handle dynamic content and interact with JavaScript elements like a human would.
Selenium’s ability to handle dynamic content and interact with JavaScript elements means you can use it to extract data from websites that constantly update (like social media feeds) or require user interaction, such as logging in or clicking buttons. For example, you could use Selenium to scrape product data from an ecommerce site, collect social media posts, or gather financial data from dynamic charts.
Read more: What is Selenium and how does it work?
Prerequisites for web scraping with Selenium
Before we get started, there are a few things you will need:
- Python (and a basic understanding of it)
- Selenium
- A web browser (e.g., Chrome)
- The specific web driver for your browser (e.g., ChromeDriver)
- Additional packages
Python
For this guide, it will be useful if you have a basic knowledge of Python, such as understanding variables, loops, functions, and data structures, as well as familiarity with Python packages and how to install them using pip
.
You can download Python for free for macOS, Windows, or Linux from the Python website.
Selenium
Selenium can be installed using a pip command in your terminal:
pip install selenium
This will install the Selenium package in Python.
Web browser
You can use any web browser, but in this guide, we’ll use Chrome, which you can download on their official website.
You should also understand some browser basics, such as:
- HTML structure including tags, attributes, and the Document Object Model (DOM)
- A basic understanding of CSS selectors and XPath for locating elements
Web driver
You need to download a browser-specific web driver. For this project, we’ll be using Chrome, so you’ll need ChromeDriver, which you can download for free from the Chrome for Developers website.
Once you've downloaded ChromeDriver, you need to make it accessible to your Python scripts. There are two main ways to do this:
- Add it to your system's PATH: This allows you to run ChromeDriver from any location. Instructions for adding to your PATH vary depending on your operating system (Windows, macOS, or Linux).
- Place it in the same directory as your script: This is a simpler option, especially if you're just getting started. Download the ChromeDriver executable and place it in the same folder where you'll save your Python script.
Alternatively, you can use the webdriver_manager
package to automatically handle ChromeDriver installation and updates. First, install webdriver_manager
using pip:
pip install webdriver-manager
Then, use this code in your script:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())
Additional packages
We recommend installing the following packages:
pip install webdriver-manager # Simplifies webdriver management
pip install requests # Useful for making HTTP requests
pip install beautifulsoup4 # A library for parsing HTML and XML
If you’re unsure whether you have these packages installed already, you can check using the terminal by typing:
pip freeze
This will list all the packages you’ve installed and their version number.
How to inspect a web page
To extract data with Selenium, you first need to identify the HTML elements that contain the data you're interested in. This means you need to inspect the webpage to find relevant CSS selectors or XPath expressions.
Step 1: Open developer tools
- Right-click on the element you want to inspect and select Inspect (in most browsers like Chrome or Edge). This will open the developer tools panel and highlight the HTML code for the selected element.
- Alternatively, press Ctrl+Shift+I (Windows/Linux) or Cmd+Option+I (Mac) to open developer tools.
Step 2: Examine the HTML
Once the element is highlighted:
- Look at its tag name (e.g.,
<div>
,<span>
,<a>
). - Check for attributes like
id
,class
, orname
. These attributes are often used to locate elements.
Example: For a button with the following HTML:
<button class="btn-submit" id="submit-button">Submit</button>
You can identify it by:
- CSS Selector:
.btn-submit
(class) or#submit-button
(id). - XPath:
//button[@id='submit-button']
or//button[contains(@class, 'btn-submit')]
.
Step 3: Copy CSS Selector or XPath
- Right-click on the element in the developer tools and select Copy → Copy selector or Copy XPath.
- Paste it into your Selenium code as needed.
Step 4: Use Selenium to locate elements
Once you have the CSS selector or XPath, use the following methods in Selenium to locate elements.
To locate elements by CSS selector (using the same example as before) you can use:
element = browser.find_element(By.CSS_SELECTOR, ".btn-submit")
To locate elements by XPath, you can use:
element = browser.find_element(By.XPATH, "//button[@id='submit-button']")
Example: Finding quotes on a page
For a webpage like Quotes to Scrape, each quote is inside a <span>
with the class text
. You can locate these elements by inspecting its HTML:
<span class="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
Then using the CSS selector .text
or XPath //span[@class='text']
to locate it in Selenium:
quotes = browser.find_elements(By.CSS_SELECTOR, ".text")
for quote in quotes:
print(quote.text)
Tips for inspecting webpages
- As you hover over the HTML in developer tools, it will highlight the corresponding element on the page.
- Press Ctrl+F (Windows/Linux) or Cmd+F (Mac) in developer tools to search for tags, classes, or text.
- Paste your CSS selector or XPath into the search bar to verify it selects the correct elements.
How to scrape a website with Selenium
Now that you have Selenium and a webdriver set up, let’s dive into some basic Selenium concepts that you’ll need for web scraping.
Importing libraries
First, you need to import the necessary modules from the Selenium library. To work with the webdriver, enter the following import statement in your terminal:
from selenium import webdriver
Creating a driver instance
Next, create an instance of the webdriver. This object will be your interface for controlling the web browser. Here’s how you create a ChromeDriver instance:
browser = webdriver.Chrome()
If you’re using webdriver_manager
, you would instead use:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())
Navigating to a web page
To use Selenium to open a specific web page in the browser, use the browser.get()
method. For this example, we’ll be scraping a for-purpose website called Quotes to Scrape, so let’s navigate to their homepage.
browser.get("https://quotes.toscrape.com/")
This code will open the Quotes to Scrape website (https://quotes.toscrape.com/) in your automated browser window, ready for you to start extracting data.
Locating elements
To extract data from Quotes to Scrape, you need to locate the specific elements on the page that contain the information you're interested in. Selenium uses locators to find elements. Here are some common locators:
By.ID
: Locates an element by its uniqueid
attribute.By.CLASS_NAME
: Finds elements with a specific CSS class name.By.XPATH
: Uses an XPath expression to locate an element based on its position in the HTML or its attributes.By.CSS_SELECTOR
: Uses a CSS selector to find elements.
You can use these locators with the following methods:
browser.find_element(locator, value)
: Finds a single element that matches the locator.browser.find_elements(locator, value)
: Finds all elements that match the locator. This returns a list of web elements.
On the Quotes to Scrape homepage, if you inspect the HTML, you'll notice that each quote is contained within a div
element with the class quote
. You can use this information to locate all the quote elements:
from selenium import webdriver
from selenium.webdriver.common.by import By
# Initialize the WebDriver
driver = webdriver.Chrome()
try:
# Navigate to the quotes page
driver.get("https://quotes.toscrape.com/")
# Find all quote elements on the page
quotes = driver.find_elements(By.CLASS_NAME, "quote")
# ... rest of the code to extract data from quotes ...
except Exception as e:
print(f"Error occurred: {e}")
finally:
# Ensure the driver is closed
driver.quit()
Explanation
- WebDriver initialization: A Chrome WebDriver instance is created to control the browser. Make sure you have the Chrome WebDriver installed and available in your system's PATH.
- Page navigation: The script navigates to the specified URL using
driver.get()
. - Finding elements: The script uses
find_elements
to locate all elements with the class name "quote". Each of these elements contains a quote and its author.
The find_elements()
method (note the plural "elements") returns a list of all the elements that match the locator. In this case, it will give you a list of all the div
elements with the class quote
.
If you wanted to find a single element, you would use find_element()
(singular).
Extracting data
Now that you have a list of quote elements, you can extract the actual quote text and author information.
You can use element.text
to get the text of an element. For example, to extract the text of the first quote, you can use:
first_quote = quotes[0].find_element(By.CSS_SELECTOR, "span.text").text
print(first_quote)
This code will find the span
element with the class text
within the first quote element and print its text content.
To extract the value of an attribute from an element, use element.get_attribute(‘attribute name’)
. For example, to get the author of the first quote, you can use:
first_author = quotes[0].find_element(By.CLASS_NAME, "author").text
print(first_author)
This code will find the element with the class author
within the first quote element and print its text content, which is the author’s name.
By combining locators and these extraction methods, you can effectively scrape the quotes and author information from the Quotes to Scrape website.
Handle pagination (if needed)
Many websites (including Quotes to Scrape) split their content across multiple pages to improve loading speed and user experience. This is called pagination. If you want to scrape all the quotes from Quotes to Scrape – not just the ones on the first page – your Selenium scraper needs to be able to automatically move from one page to the next.
Here’s how to handle pagination with Selenium:
- Inspect the page to find the element that acts as the Next button or link. On "Quotes to Scrape," it's an
<a>
tag with the classnext
. - Use Selenium to click the Next button to navigate to the next page.
- Use a loop to repeat the process of locating quotes, extracting data, and clicking the Next button until you reach the end of the pages.
Here’s an example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
# Initialize the WebDriver
driver = webdriver.Chrome()
try:
# Navigate to the quotes page
driver.get("https://quotes.toscrape.com/")
while True:
# Find all quote elements on the current page
quotes = driver.find_elements(By.CLASS_NAME, "quote")
# Iterate over each quote element and extract the text and author
for quote in quotes:
text = quote.find_element(By.CLASS_NAME, "text").text
author = quote.find_element(By.CLASS_NAME, "author").text
print(f"Quote: {text}\nAuthor: {author}\n")
try:
# Find the "Next" button and click it to go to the next page
next_button = driver.find_element(By.LINK_TEXT, "Next")
next_button.click()
except NoSuchElementException:
# If there is no "Next" button, break the loop
break
except Exception as e:
print(f"Error occurred: {e}")
finally:
# Ensure the driver is closed
driver.quit()
Explanation
The script uses a while True
loop to continuously scrape quotes from each page. On each page, it finds all elements with the class name "quote" and extracts the text and author.
Pagination handling:
- The script looks for a Next button using
find_element(By.LINK_TEXT, "Next")
. - If the Next button is found, it clicks the button to navigate to the next page.
- If the Next button is not found (indicating the last page), a
NoSuchElementException
is raised, and the loop breaks.
Store the data
Once you've extracted the data, you'll want to store it in a structured format (like a database or CSV file) for later use.
CSV file
CSV (Comma Separated Values) is a simple and widely supported format. You can use Python's csv
module to write data to a CSV file.
import csv
# ... (your code to extract quotes and authors)
with open('quotes.csv', 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Quote', 'Author']) # Write header row
for quote, author in zip(all_quotes, all_authors): # Assuming you have lists of quotes and authors
writer.writerow([quote, author])
Database
For larger datasets or more complex storage needs, you can use a database (like SQLite, MySQL, or PostgreSQL). You'll need to use a database connector library (like sqlite3
for SQLite) to interact with the database.
import sqlite3
# ... (your code to extract quotes and authors)
conn = sqlite3.connect('quotes.db')
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS quotes
(id INTEGER PRIMARY KEY AUTOINCREMENT, quote TEXT, author TEXT)''')
for quote, author in zip(all_quotes, all_authors):
cursor.execute("INSERT INTO quotes (quote, author) VALUES (?, ?)", (quote, author))
conn.commit()
conn.close()
You can adapt this example to your specific needs and the format you prefer for storing the data you’ve scraped.
Wrapping up your first Selenium scraper
Congratulations! You've just completed a fully functional web scraping project with Selenium. By following this guide, you’ve learned how to extract, clean, and store data programmatically. Now you’ve stored your data, your project is complete. You can further manipulate or analyze your data, or you can use your new tools to tackle similar challenges on other websites.
How to scrape images
Another website built for scraping is books.toscrape.com. This is a fictional bookstore with images of each book cover. You can use Selenium to scrape images from Books to Scrape and store them in a database, following these steps:
- Scrape image URLs: Use Selenium to navigate the site and extract image URLs
- Download images: Use the
requests
library to download images from the URLs - Store images in a database: Use SQLite to store image data as binary blobs
First, ensure you have the necessary libraries installed:
pip install selenium requests sqlite3
Code example
import os
import requests
import sqlite3
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
# Initialize the WebDriver
driver = webdriver.Chrome()
# Connect to SQLite database (or create it)
conn = sqlite3.connect('book_images.db')
cursor = conn.cursor()
# Create a table to store images
cursor.execute('''
CREATE TABLE IF NOT EXISTS images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT,
image BLOB
)
''')
def download_image(url):
response = requests.get(url)
if response.status_code == 200:
return response.content
return None
def save_image_to_db(url, image_data):
cursor.execute('INSERT INTO images (url, image) VALUES (?, ?)', (url, image_data))
conn.commit()
try:
# Navigate to the books page
driver.get("http://books.toscrape.com/")
while True:
# Find all book elements on the current page
books = driver.find_elements(By.CLASS_NAME, "product_pod")
# Iterate over each book element and extract the image URL
for book in books:
img_element = book.find_element(By.TAG_NAME, "img")
img_url = img_element.get_attribute('src')
print(f"Downloading image: {img_url}")
image_data = download_image(img_url)
if image_data:
save_image_to_db(img_url, image_data)
try:
# Find the "Next" button and click it to go to the next page
next_button = driver.find_element(By.CLASS_NAME, "next")
next_button.find_element(By.TAG_NAME, "a").click()
except NoSuchElementException:
# If there is no "Next" button, break the loop
print("No more pages to navigate.")
break
except Exception as e:
print(f"Error occurred: {e}")
finally:
# Ensure the driver is closed
driver.quit()
# Close the database connection
conn.close()
Explanation
- WebDriver initialization: A Chrome WebDriver instance is created to control the browser
- Database setup: Connect to an SQLite database and create a table to store images if it doesn't already exist
- Image downloading: Use the
requests
library to download images from their URLs - Storing images: Save the image data as binary blobs in the SQLite database
- Pagination handling: The script navigates through pages using the "Next" button until the last page is reached
- Error handling: Basic error handling is included to catch exceptions during the process
- Cleanup: The WebDriver and database connection are closed after the script completes
Considerations
- Database size: Storing images as blobs can increase the database size significantly. Consider storing image paths instead if you have a large number of images.
- Network and performance: Downloading images can be network-intensive and slow, especially for large images or many images. Consider optimizing the download process if needed.
- Image format: Ensure that the images are in a format that can be stored and retrieved correctly from the database.
How to access scraped images
To view the images stored in your SQLite database, you need to extract the binary data from the database and save it as image files on your local filesystem. Here's how you can do that:
import sqlite3
import os
# Connect to the SQLite database
conn = sqlite3.connect('book_images.db')
cursor = conn.cursor()
# Create a directory to save extracted images
os.makedirs('extracted_images', exist_ok=True)
# Query to select all images from the database
cursor.execute('SELECT id, url, image FROM images')
# Iterate over each image record
for record in cursor.fetchall():
image_id, url, image_data = record
# Extract the image name from the URL
image_name = url.split('/')[-1]
# Save the image to the local filesystem
with open(f'extracted_images/{image_name}', 'wb') as img_file:
img_file.write(image_data)
print(f"Extracted {image_name}")
# Close the database connection
conn.close()
After running the above script, you will find the extracted images in the extracted_images
directory. You can open these images using any image viewer on your computer.
Explanation
- Connect to the SQLite database where the images are stored
- Directory creation: Create a directory named
extracted_images
to store the extracted image files - Execute a SQL query to select all image records from the database
- Iterate over each record, extracting the image ID, URL, and binary data
- Use the URL to determine the image file name
- Write the binary data to a file in the
extracted_images
directory - Close the database connection after the extraction process is complete
Common web scraping challenges and error handling
As you start working on more web scraping projects, you'll likely encounter some common challenges. Here are a few of them, and how to solve them.
IP blocking
Websites often detect and block repeated requests from the same IP address, as this can indicate automated activity. To avoid this, you can use proxies to rotate your IP address and make your requests appear to come from different locations.
If you want your requests to appear as if they are coming from a real user (instead of your automated script) you should use residential proxies that use real residential IP addresses to mask your traffic.
If your proxies require authentication, you'll need to handle that separately, as Selenium's proxy setup doesn't directly support proxy authentication.
Timeouts and slow responses
Use try-except
blocks to handle cases where elements aren't found within the specified time:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Initialize the WebDriver
driver = webdriver.Chrome()
try:
# Navigate to the desired page
driver.get("https://example.com")
# Wait for a specific element to appear
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "target_element_id"))
)
print("Element found:", element)
except Exception as e:
print(f"Error occurred: {e}")
finally:
# Ensure the driver is closed
driver.quit()
Explanation of the code
This code uses Selenium to wait for a specific element to be present in the DOM of a web page.
- WebDriver Initialization: A Chrome WebDriver instance is created to control the browser.
- Page Navigation: The browser navigates to the specified URL.
- Waiting for Element:
WebDriverWait
is used to wait up to 10 seconds for the element with the specified ID to be present in the DOM. - Exception Handling: If the element is not found within the timeout, an exception is caught, and an error message is printed.
- Driver Cleanup: The
finally
block ensures that the browser is closed after the script completes, regardless of whether an exception occurred.
Rate limiting
Many websites implement rate limits to restrict the number of requests you can make within a specific timeframe. This helps prevent their servers from being overloaded. To respect rate limits, you can add delays between your requests using time.sleep()
to mimic human browsing behavior.
CAPTCHAs
CAPTCHAs are those distorted text images or puzzles that websites use to verify that a user is human. Solving CAPTCHAs can be tricky with Selenium. There are some libraries and services that can help automate CAPTCHA solving, but they might not always be reliable.
Dynamic content
Websites that heavily rely on JavaScript to load content can be more challenging to scrape with Selenium. Sometimes, the content you want to extract might not be immediately available in the page source, as it's loaded dynamically. You might need to use techniques like waiting for elements to load or combining Selenium with other libraries to handle dynamic content effectively.
Here’s how to wait for elements to load with Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://example.com/dynamic-content-page")
try:
# Wait for a specific element to appear
content = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "dynamic-content"))
)
print(content.text) # Extract text from the dynamically loaded element
except Exception as e:
print("Content not found:", e)
driver.quit()
Explanation of the code
- WebDriver Initialization: A Chrome WebDriver instance is created to control the browser.
- Page Navigation: The browser navigates to the specified URL.
- Waiting for Element:
WebDriverWait
is used to wait up to 10 seconds for the element with IDdynamic-content
to be present in the DOM. - Extracting Text: Once the element is found, its text content is printed.
- Exception Handling: If the element is not found within the timeout, an exception is caught, and an error message is printed.
- Driver Cleanup: The
driver.quit()
call ensures that the browser is closed after the script completes
Next steps
Congratulations on completing your first web scraping project with Selenium! Now that you have a basic understanding of the fundamentals, here are some ideas for expanding your skills and taking on more ambitious projects:
- Dive deeper into Selenium's features. Learn how to handle cookies, user logins, and interact with more complex website elements.
- Explore how to use Selenium in conjunction with other libraries like BeautifulSoup for parsing HTML or Scrapy for building more advanced web scraping frameworks.
- Apply your web scraping skills to real-world problems. Extract data for market research, price comparison, or any other project that interests you. Remember to always respect website terms of service and scrape ethically.
Scraping with SOAX
Now that you’ve mastered the basics, it’s time to scale your scraper for larger projects. Consider integrating advanced tools like SOAX’s proxy services and scraping APIs to handle challenges such as IP bans, CAPTCHAs, and geographic restrictions. With SOAX, you can:
- Access a global pool of residential, mobile, ISP, and datacenter proxies for uninterrupted scraping
- Use smart scraper APIs to automate and simplify data extraction
- Reduce developer time with ready-to-use scraping solutions
Start experimenting by applying what you’ve learned to real-world projects. With the right tools and techniques, you’ll unlock data at scale while staying efficient and compliant.