How to scrape a website using Python: Step-by-step guide for beginners

Written by: Lisa Whelan

Web scraping is a powerful way to extract information from websites. It automates data collection, saving you from tedious manual work (like copying and pasting the data you need). If you’re new to web scraping, we recommend learning using a programming language called Python, because it is relatively simple and has purpose-built libraries that make it easy to scrape data from both static and dynamic websites.

This guide shows you how to use Python to scrape data from two websites: a static bookstore with over 1,000 titles, and a dynamic quotes site that changes its content based on how people interact with it. You’ll learn how to:

  • Navigate website structures
  • Retrieve HTML content
  • Extract the data you need
  • Store the data in an easy-to-use format (like CSV)

By the end of this article, you’ll be able to start your own web scraping projects!

Contents:


What is web scraping?

Web scraping is the automated extraction of data from websites. It uses web scrapers to navigate web pages, retrieve the desired information, and store it in a structured format, such as a CSV file or a database. Web scraping is widely used for tasks like market research, data analysis, and competitive intelligence. By automating the data collection process, you can gather large amounts of information quickly and efficiently, which can be a vital process for anyone needing to make data-informed decisions.

Why use Python for web scraping?

Python is a popular choice of programming language for web scraping because it has a rich set of tools and web scraping libraries. Python's clean and readable syntax makes it easy to understand and write code, even for beginners.

Python also has an extensive ecosystem that includes libraries like BeautifulSoup, Scrapy, and Selenium, which are specifically designed for web scraping tasks. These web scraping libraries provide powerful tools for navigating web pages, extracting data, and handling dynamic content.

Additionally, Python’s large and active community means you can find plenty of resources, tutorials, and support to help you with your web scraping projects. Whether you’re a novice or an experienced developer, Python is versatile and easy to use, so it's the best choice for effective web scraping.

Static vs dynamic websites

Before we start scraping, it’s important to know if the web page you want to scrape is static or dynamic. This will help you to choose the right tools for the job.

Static websites present the same information to every user, remaining unchanged until a developer modifies the source code. Think of a static bookstore—every visitor sees the same collection of titles, prices, and ratings.

On the other hand, dynamic websites offer a tailored experience, displaying different content based on your interactions. Social media platforms like Instagram, X (formerly Twitter), and Facebook are good examples of dynamic websites; the content you see changes depending on your unique account and preferences, which creates a personalized browsing experience.

Static websites

Static websites show the same content to everyone. They don't change unless someone updates the website's code. It's like a library where the books and their information always stay the same.

The fictional bookstore we use as an example in this guide is a static website. The website has listings for more than 1000 books. Each book has several fields, including:

  • Title
  • Price
  • Rating
  • Image
  • URL that redirects you to a page with more information on the specific book

The products page for all the books on the website books.toscrape.com. There are several book cover images with title, pricing, and "add to basket" buttons.

Because it's a static website, all of this content appears in the same way for everyone who visits the website.

Dynamic websites

Dynamic websites use JavaScript to load content dynamically, presenting a unique challenge for web scraping. Unlike static websites, where the content is readily available in the initial HTML response, dynamic websites generate content based on how you interact with it. This makes traditional web scraping methods less effective.

Luckily, Python libraries such as Selenium and Scrapy provide tools for handling dynamic websites. Selenium, for instance, can automate web browsers to interact with dynamic content, while Scrapy offers advanced features for extracting data from complex web pages. By leveraging these libraries, you can effectively scrape data from dynamic websites, ensuring you capture all the information you need.

Dynamic websites show different things to different people. The content changes based on what you do on the site. Think of Amazon, where you see product recommendations based on your past purchases.

In our example, we’ll scrape a quotes website that displays quotes dynamically. Each quote has:

  • An author
  • The quote
  • Some tags

python 2

Web scraping with Python: Step-by-step overview

Web scraping and scraping data can be generally broken down into four steps:

  1. Inspecting the website
  2. Making a HTTP request
  3. Processing the HTML
  4. Storing the data

Step 1: Inspect the website

The first step is to inspect the website, so you can understand its structure and the location of your desired element within the HTML. For example, on the static bookstore website, the image is above the rating and the rating is above the book title.

python 1

Step 2: Make a HTTP request 

Once you understand the structure, you can make a HTTP request that targets the specific HTML element you want to scrape and pulls all the content.

Step 3: Process the HTML

Now you can use an HTML parser to process the HTML you retrieved and extract the data you need from it.

Step 4: Store the data

As the last step, you need to store the extracted data in a database (or whatever format best suits your needs). In this example, we will use a CSV file.

Prerequisites

In this guide, we will assume you don’t know anything about Python or web scraping, and will cover every step from scratch. However, if you have some basic understanding of Python, you will find scraping much easier.

Install Python

If you don’t have Python installed already, follow the official documentation to install Python on your computer. This guide was made using Python 3.12 and pip version 23.2.1.

Python libraries

You will use three Python web scraping libraries to help you:

  • Requests: This library makes HTTP requests to the websites and pulls the HTML content. It is generally used for static websites.
  • Selenium: This is a browser automation tool that is used to scrape dynamic websites and automate website interactions.
  • BeautifulSoup: This library will be used to process and extract the data you want from the HTML.

To install these three libraries using pip, run the following command:

pip install beautifulsoup4 requests selenium

The final project folder structure will be:

|Project_Folder
   |_ scrape_bookstore.py
   |_ scrape_quotes.py

Setting up a web scraping project

Setting up a web scraping project involves several steps. Here are some important considerations to keep in mind:

  • Choose the right tools and libraries: Python offers a wide range of libraries and frameworks for web scraping, including BeautifulSoup, Scrapy, and Selenium. Each library has its strengths, so choose the one that best fits your project’s needs. BeautifulSoup is great for parsing HTML and XML documents, Scrapy is a powerful framework for large-scale scraping, and Selenium is ideal for interacting with dynamic content.
  • Select the target website: Choose a website that contains the data you need and is suitable for web scraping. It’s crucial to check the website’s terms of service and robots.txt file to ensure that web scraping is allowed. Respecting these guidelines helps you avoid legal issues and ensures ethical scraping practices.
  • Design the scraping process: Plan out the steps involved in your scraping process. This includes identifying the data you want to extract, understanding the website’s structure, and handling potential errors. Consider how you will store the scraped data, whether in a CSV file, a database, or another format. Proper planning helps you create a robust and efficient scraping script.

By following these steps and considering these key factors, you can set up a successful web scraping project and start extracting data from websites. Whether you’re gathering data for research, analysis,or business intelligence, a well-planned web scraping project can provide valuable insights and save you time and effort.

How to scrape a static website with Python

Open your browser and load the fictional bookstore website: https://books.toscrape.com. This guide will help you build a web scraper to extract data from this site.

In order to open the developer tools, use the following command: Ctrl + shift + i (Windows) Command + Option + i (MacOs) This will open a window, where you can access the website’s code and structure.

The books to scrape website with the HTML open on the right hand side and the Elements tab highlighted

Note: Make sure that you’re on the Elements tab.

Inspecting the static website

Right-click on the first book and on the drop-down menu click on Inspect. In the DevTools panel, you will see the specific element’s code open. 

python 5

On reading the HTML, you’ll notice that the book and all its contents are contained in the product_pod class.

Fetching the HTML for the static website 

Before you can make use of the product_class, you need to get the HTML content using the requests library. In scrape_bookstore.py add this code:

import requests

Base URL
base_url = 'https://books.toscrape.com'

Send GET request to the base URL
response = requests.get(base_url)

Get the HTML content
html_content = response.text

Print the HTML content
print(html_content)

This code imports the library, navigates to the base URL, and retrieves the full HTML code.

Run the above script using the command:

python scrape_bookstore.py

You will see the HTML printed out on your terminal. It will look something like this:

A screen shot of the web page's HTML after extraction.

Extracting data from the HTML for the static website 

Now you need to extract the data for all the books listed on the website. To do that, you need to use the BeautifulSoup library installed earlier. BeautifulSoup will help you locate data in a couple of ways. You can use the element’s ID, class, or XPath.

Here’s what each of them mean:

  • ID: A unique identifier for an element, allowing direct access to it in the DOM
  • Class: Shared identifier used for various HTML elements. Allows developers to group similar elements together
  • XPath: Query language used to select elements and navigate complex documents

When you inspected the HTML, you identified that the information about the book is located within the article class product_pod. Return to the DevTools tab and you will notice the following:

  • The title can be found using the title attribute of a link tag, which is contained inside a h3 tagpython 7.5-1
  • The price is inside a paragraph tag nested inside a div with classes price_color and product_price respectivelypython 8
  • The book’s URL is in the href attribute of the same link tag that contains the title of the bookpython 7
  • The image URL is in a div tag that has the class image_container. Inside this div, there’s an anchor (<a>) tag that wraps an image tag (<img>), and the src attribute of this image tag provides the direct link to the imagepython 9
  • When it comes to star ratings, the structure is particularly interesting. You can locate the star ratings by following this hierarchy: product_pod > star-rating, where the number of stars is indicated by the class name. For instance, a product with a three-star rating can be represented as product_pod > star-rating Three

python 10

If you only wanted five-star rated books, you can extract them by adjusting the code in scrape_bookstore.py to:

import requests 
from bs4 import BeautifulSoup 
 # Base URL 
base_url = '[https://books.toscrape.com](https://books.toscrape.com)' 
 response = requests.get(base_url) 
 soup = BeautifulSoup(response.text, 'html.parser') 
 # Find all books with 5-star rating 
for book in soup.find_all('article', {'class': 'product_pod'}): 
   if book.find('p', {'class': 'star-rating Five'}): 
       try: 
           title = book.find('h3').find('a')['title'] 
           price = book.find('p', {'class': 'price_color'}).text[1:] 
           book_url = base_url + book.find('h3').find('a')['href'] 
           image_url = base_url + book.find('img')['src'][3:] 
           
           print(title) 
           print(price) 
           print(book_url) 
           print(image_url) 
           
       except Exception as e: 
           print(f"Error processing data for book '{title}': {e}")  

When you run this code, you will notice the script only retrieves the first page of the website. It begins by importing the necessary libraries and then fetches the HTML content using the requests.get(base_url) method and parses it with BeautifulSoup. 

The script iterates through each book listing found in the article tags with the class product_pod, checking for a five-star rating using if book.find('p', {'class': 'star-rating Five'})

If a five-star book is identified, it extracts essential information such as the title from the anchor tag within the h3 tag, the price using text[1:] to remove the currency symbol, and constructs the full book and image URLs.

Pagination when scraping static websites

As you scroll down the website, you'll encounter a Next button. Clicking it triggers two actions: 

  • the URL updates 
  • additional books are displayed

This behavior also applies when navigating to subsequent pages.

python 11

The URL structure allows for looping through all available pages. Since the website contains 50 pages, you can modify your code to scrape information from each page. The updated code becomes:

import requests  
from bs4 import BeautifulSoup  
import csv  

# Establishing our Base URL  
base_url = '[https://books.toscrape.com](https://books.toscrape.com)' 

# Looping through all the pages  
for i in range(1, 51):  # The website has 50 pages  
   url = f"{base_url}/catalogue/page-{i}.html"  
   response = requests.get(url)  
   soup = BeautifulSoup(response.text, 'html.parser')  

   # Hunting for all books with a 5-star rating  
   for book in soup.find_all('article', {'class': 'product_pod'}):  
       if book.find('p', {'class': 'star-rating Five'}):  
           try:  
               title = book.find('h3').find('a')['title']  
               price = book.find('p', {'class': 'price_color'}).text[1:]  
               book_url = base_url + book.find('h3').find('a')['href']  
               image_url = base_url + book.find('img')['src'][3:]  
               
               print(title)  
               print(price)  
               print(book_url)  
               print(image_url)  
           
           except Exception as e:  
               print(f"Error processing data for book '{title}': {e}")

The output on your terminal should look something like this:

python 12

This code enables you to loop through all 50 pages of the bookstore website, extracting essential details about books with a five-star rating. For each book, it retrieves the title, price, URL, and image URL, printing them to the console for easy access. Error handling ensures that any issues encountered during data extraction are reported. It uses f-strings to alter the URL and updates the number by using a variable that is auto-incremented by the for loop.

Storing scraped data in a CSV file

So far, you’ve printed the results of your web scraping project to the terminal. If you want to use this data for further analysis in other projects, you can store it in a CSV, JSON, or a database. 

Python's built-in CSV module allows you to save the data as a CSV file, which can be named descriptively to reflect its content—like 5_star_books.csv for this dataset. 

By writing to the CSV file each time you find a book that meets the five-star rating criteria, you can preserve your results in case the script encounters an error later, saving time and resources when you rerun it.

The updated code is as follows:

import requests 
from bs4 import BeautifulSoup 
import csv 
 # Base URL 
base_url = 'http://books.toscrape.com/' 
 # Initialize CSV file 
with open('5_star_books.csv', 'w', newline='') as csvfile: 
   fieldnames = ['Title', 'Price', 'URL', 'Image URL','Web URL'] 
   writer = csv.DictWriter(csvfile, fieldnames=fieldnames) 
   writer.writeheader() 
    # Loop through all the pages 
   for i in range(1, 51):  # The website has 50 pages 
       url = f'{base_url}catalogue/page-{i}.html' 
       
       response = requests.get(url) 
       soup = BeautifulSoup(response.text, 'html.parser') 
       
       # Find all books with 5-star rating 
       for book in soup.find_all('article', {'class': 'product_pod'}): 
           if book.find('p', {'class': 'star-rating Five'}): 
               
               try: 
                   title = book.find('h3').find('a')['title'] 
                   price = book.find('p', {'class': 'price_color'}).text[1:] 
                   book_url = base_url + book.find('h3').find('a')['href'] 
                   image_url = base_url + book.find('img')['src'][3:] 
                   
                   # Write to CSV 
                   writer.writerow({ 
                       'Title': title, 
                       'Price': price, 
                       'URL': book_url, 
                       'Image URL': image_url ,
                       'Web URL':url
                   }) 
               
               except Exception as e: 
                   print(f"Error processing data for book '{title}': {e}") 

This code snippet initializes a CSV file for storing the scraped data, writes headers for the columns, and loops through all 50 pages of the bookstore. For each five-star book found, it extracts the title, price, URL, and image URL, and appends these details into the CSV file. The final CSV file will be located in your project folder alongside your Python scripts, ensuring easy access to the collected data, structured as follows:

|Project_Folder
   |_ scrape_bookstore.py
   |_ scrape_quotes.py
   |_ 5_star_books.csv	

The resulting CSV file will look like this:

A spreadsheet with the headings title, price, url, image url, and web url with all the details beneath

How to scrape a dynamic website with Python

Dynamic websites pose unique challenges because their content is continually changing. These changes are often generated by a programming language like JavaScript.

Inspecting the dynamic website

To inspect a dynamic website, navigate to the quotes website and open Developer Tools in your browser. Use the following shortcuts:

  • Ctrl + shift + i (Windows)
  • Command + Option + i (MacOs)

This will open a panel, typically docked at the bottom or side of your browser. Then you can select the Elements tab.

the quotes to scrape website with the html exposed and the elements tab highlighted

Fetching the HTML

To fetch the HTML from the dynamic website, you'll need Selenium, which requires a compatible web driver (e.g., ChromeDriver). Ensure you've downloaded the appropriate driver for your browser and know its executable path–the location on your computer where the driver file is stored.

You can use any browser you want to, although Chrome is recommended due to its large documentation and community support.

Selenium works with several drivers. You can find the full list on the Selenium website. At the time of writing, Selenium supports:

  • Google Chrome
  • Firefox
  • Microsoft Edge
  • Internet Explorer
  • Safari

In the scrape_quotes.py file you created earlier, populate the following code:

from selenium import webdriver  

# Set up the webdriver. 
# In this example, we're using the Chrome driver. 
# Replace with the path to your chromedriver or whichever driver you used
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')

# Navigate to the website  
driver.get("https://quotes.toscrape.com/js/")  

# Get the HTML source  
html_source = driver.page_source  
print("HTML source of the website:", html_source)  

# Close the browser to free up resources 
driver.close()

This code:

  • Initializes Selenium with the ChromeDriver
  • Navigates to the specified URL
  • Retrieves the page's HTML
  • Prints the HTML it to the terminal
  • Closes the browser to free resources. 

Once you run the code above, it will print the HTML of the webpage on your terminal.

the html of the dynamic website with indents

Extracting data from HTML for the dynamic website 

Next, you need to identify specific elements that contain the data you want. Upon inspection, you'll find:

  • You can find the quote in a <span> with the class textthe quotes website with the html exposed and the quote in the span tag highlighted
  • You can find the author within a <small> tag with the class author the author tags highlighted in the html of the quotes website
  • The tags are in a <div> of class tags, each represented by an <a> tag the quotes to scrape website and its html, with the tags and the associated html highlighted

You can then adjust your code to use BeautifulSoup to extract the above information as follows:

from selenium import webdriver  
from bs4 import BeautifulSoup

# Set up the webdriver. 
# In this example, we're using the Chrome driver. 
# Replace with the path to your chromedriver 
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')

# Navigate to the website  
driver.get("https://quotes.toscrape.com/js/")  

# Get the HTML source  
html_source = driver.page_source  

# Close the browser to free up resources 
driver.close()

soup = BeautifulSoup(html_source, 'html.parser') 

# Find all quotes
for quote in soup.find_all('div', {'class': 'quote'}): 
   try: 
       quote_text = quote.find('span',{'class':'text'}).text[1:]  
       quote_author = quote.find('small', {'class': 'author'}).text[1:] 
       # Find all tags within the quote
       tags = quote.find_all('a', {'class': 'tag'})
       tag_list = [tag.text for tag in tags]

       print(f"Quote: {quote_text}")
       print(f"Author: {quote_author}")
       print(f"Tags: {tag_list}") 
      
   except Exception as e: 
       print(f"Error processing data for quote '{quote_text}': {e}") 

The above code will return the following on your terminal:

python 19

This Python script uses Selenium and BeautifulSoup to scrape quotes from a dynamically loaded webpage. It imports the necessary libraries and sets up the Chrome WebDriver, specifying the path to chromedriver. The script then navigates to the quotes website and captures the complete HTML source after the page loads.

Once the script closes the browser to free up resources, BeautifulSoup to parse the HTML. The script searches for all <div> elements with the class quote, iterating through them to extract the quote text, author, and associated tags.

The script prints and formats each piece of data. It then implements error handling to catch and report any issues encountered during data extraction, ensuring robust performance even in the face of unexpected HTML structure changes. This combination of tools automates the extraction of structured data from dynamic web content.

Pagination when scraping dynamic websites

When navigating the quotes website, you may not know the total number of pages. To handle pagination, instruct Selenium to click the Next button until it reaches the last page. 

The first thing you need to identify is the class of the Next button. You can see it is within the list tag of class next. Using this information, you can make Selenium click on the Next button if it exists.

You can create a function to streamline the quote extraction:

def find_quotes_on_page(soup):
   for quote in soup.find_all('div', {'class': 'quote'}):
       try:
           quote_text = quote.find('span', {'class': 'text'}).text[1:]
           quote_author = quote.find('small', {'class': 'author'}).text[1:]

           # Find all tags within the quote
           tags = quote.find_all('a', {'class': 'tag'})
           tag_list = [tag.text for tag in tags]

           print(f"Quote: {quote_text}")
           print(f"Author: {quote_author}")
           print(f"Tags: {tag_list}")

       except Exception as e:
           print(f"Error processing quote: {e}")

This function extracts quotes from the provided soup object, handling errors while printing out the details.

To check if the Next button exists, use this function:

def check_if_next(soup):
   next_button = soup.find('li', {'class': 'next'})
   return bool(next_button)

You also need a function that clicks the actual Next button if it exists:

def navigation(driver):
   try:
       # Get the HTML source
       html_source = driver.page_source
   except Exception as e:
       print(f"Error retrieving HTML source: {e}")
       return None, driver

   if html_source:
       # Parse the HTML with BeautifulSoup
       soup = BeautifulSoup(html_source, 'html.parser')
       return soup, driver
   else:
       return None, driver


def click_next(driver):
   next_button = driver.find_element(By.CSS_SELECTOR, "li.next > a")  # Using anchor tag within the 'li'
   next_button.click()

   # Optional: Wait for the page to reload before proceeding (adjust timeout as needed)
   time.sleep(2)
   # Get the new HTML and parse it
   soup, driver = navigation(driver)
   return soup, driver

The click_next function searches for the next button (identified by the CSS selector li.next > a), clicks it to navigate to the subsequent page, and pauses to allow the new page to load before calling the navigation function again to get the updated HTML.

Lastly, you can tie everything together in a while loop that checks if the Next button exists. If it doesn't, the webdriver exits the loop:

from selenium import webdriver
from selenium.webdriver.common.by import By  # Import By class
from bs4 import BeautifulSoup
import time  # For introducing delays (optional)

# Set up the webdriver (replace with your chromedriver path)
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')


# Set the initial URL
start_url = "https://quotes.toscrape.com/js/"


# Navigate to the website
driver.get(start_url)


def navigation(driver):
   try:
       # Get the HTML source
       html_source = driver.page_source
   except Exception as e:
       print(f"Error retrieving HTML source: {e}")
       return None, driver


   if html_source:
       # Parse the HTML with BeautifulSoup
       soup = BeautifulSoup(html_source, 'html.parser')
       return soup, driver
   else:
       return None, driver


def click_next(driver):
   next_button = driver.find_element(By.CSS_SELECTOR, "li.next > a")  # Using anchor tag within the 'li'
   next_button.click()


   # Optional: Wait for the page to reload before proceeding (adjust timeout as needed)
   time.sleep(2)
   # Get the new HTML and parse it
   soup, driver = navigation(driver)
   return soup, driver


def check_if_next(soup):
   next_button = soup.find('li', {'class': 'next'})
   return bool(next_button)


def find_quotes_on_page(soup):
   for quote in soup.find_all('div', {'class': 'quote'}):
       try:
           quote_text = quote.find('span', {'class': 'text'}).text[1:]
           quote_author = quote.find('small', {'class': 'author'}).text[1:]


           # Find all tags within the quote
           tags = quote.find_all('a', {'class': 'tag'})
           tag_list = [tag.text for tag in tags]


           print(f"Quote: {quote_text}")
           print(f"Author: {quote_author}")
           print(f"Tags: {tag_list}")


       except Exception as e:
           print(f"Error processing quote: {e}")


# Initial navigation and quote extraction
soup, driver = navigation(driver)
find_quotes_on_page(soup)


# Loop through subsequent pages (if any)
while check_if_next(soup):
   soup, driver = click_next(driver)
   find_quotes_on_page(soup)


# Close the browser
driver.close()

The main script sets up the WebDriver and loads the initial URL. It uses a while loop to continuously check for the existence of the Next button through the check_if_next function. If the button is present, the script clicks it and extracts quotes from the newly loaded page, iterating until there are no more pages left. 

Storing scraped data in a JSON

The last piece of the puzzle is storing the data you have extracted. One common data format is the JSON format. To save the extracted data in JSON format, store it in a list before writing it to a file. Here's how to implement this:

import json
from selenium import webdriver
from selenium.webdriver.common.by import By  # Import By class
from bs4 import BeautifulSoup
import time  # For introducing delays (optional)


# Set up the webdriver (replace with your chromedriver path)
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')


# Set the initial URL
start_url = "https://quotes.toscrape.com/js/"


# Navigate to the website
driver.get(start_url)


# Initialize a list to store quote details  
quotes_list = []


def navigation(driver):
   try:
       # Get the HTML source
       html_source = driver.page_source
   except Exception as e:
       print(f"Error retrieving HTML source: {e}")
       return None, driver


   if html_source:
       # Parse the HTML with BeautifulSoup
       soup = BeautifulSoup(html_source, 'html.parser')
       return soup, driver
   else:
       return None, driver


def click_next(driver):
   next_button = driver.find_element(By.CSS_SELECTOR, "li.next > a")  # Using anchor tag within the 'li'
   next_button.click()


   # Optional: Wait for the page to reload before proceeding (adjust timeout as needed)
   time.sleep(2)
   # Get the new HTML and parse it
   soup, driver = navigation(driver)
   return soup, driver


def check_if_next(soup):
   next_button = soup.find('li', {'class': 'next'})
   return bool(next_button)


def find_quotes_on_page(soup):
   for quote in soup.find_all('div', {'class': 'quote'}):
       try:
           quote_text = quote.find('span', {'class': 'text'}).text[1:]
           quote_author = quote.find('small', {'class': 'author'}).text[1:]


           # Find all tags within the quote
           tags = quote.find_all('a', {'class': 'tag'})
           tag_list = [tag.text for tag in tags]


           print(f"Quote: {quote_text}")
           print(f"Author: {quote_author}")
           print(f"Tags: {tag_list}")
           quotes_list.append({
               "Quote": quote_text,
               "Author":quote_author,
               "Tags" :tag_list
           })


       except Exception as e:
           print(f"Error processing quote: {e}")


# Initial navigation and quote extraction
soup, driver = navigation(driver)
find_quotes_on_page(soup)


# Loop through subsequent pages (if any)
while check_if_next(soup):
   soup, driver = click_next(driver)
   find_quotes_on_page(soup)


# Close the browser
driver.close()


# Write the book details to a JSON file  
with open('5_star_books.json', 'w') as json_file:  
   json.dump(quotes_list, json_file, indent=4)

This code snippet demonstrates how to store the quote data in a JSON format, providing a structured and easy-to-read way to save the information. This code imports the necessary libraries, including json, which converts Python data structures into JSON. After setting up the Chrome WebDriver and navigating to the quotes website, the script initializes an empty list named quotes_list to hold the quote details as dictionaries.

The find_quotes_on_page function is responsible for extracting quotes from the parsed HTML. Within this function, each extracted quote, along with its author and tags, is formatted into a dictionary and appended to the quotes_list. Following the loop that navigates through subsequent pages using click_next, the script eventually closes the browser.

The final step involves writing your data into a JSON file. Using the json.dump method, the contents of quotes_list are saved into a file named 5_star_books.json, ensuring the data is formatted correctly with indentation for readability. This process concludes the scraping and storage operations, allowing for easy access to the collected quotes in JSON format.

Running this code will output a well-formatted JSON.

python 21

Web scraping best practices

Before you embark on your web scraping projects, you should familiarize yourself with key best practices:

  • Respect robots.txt: This file informs web crawlers about which parts of the site can be accessed. Always review and abide by its guidelines to avoid unnecessary conflicts.
  • Use proxies: For large-scale operations, using proxies can help maintain your privacy and safeguard against IP bans. This is particularly vital when scraping high-traffic sites like Instagram or eBay, where aggressive scraping can trigger defenses.
  • Avoid overloading websites: To maintain good etiquette, ensure that your scraping activity does not overwhelm the website's server. Throttle your requests to stay within reasonable limits.
  • Refrain from scraping sensitive data: Always respect privacy concerns and legal boundaries. Scraping personal information without consent is not only unethical, it can also lead to legal repercussions.
  • Employ headers and user agents: By including these in your requests, you can mimic normal browsing behavior, thereby reducing the risk of being flagged as a bot by the website.

Advanced web scraping techniques

Now that you understand the basics of web scraping in Python, we’ll quickly go over some more advanced techniques that you can use to take your operations to the next level. We’ll cover challenges such as bot detection, XPath expressions, and infinite scrolling and discuss code samples to overcome them.

Bot detection

While scraping is very useful for us developers, websites generally frown upon it as it uses up precious server resources without generating any value for them. Thus, many websites use bot detection and blocking mechanisms such as CAPTCHAs to prevent bots from scraping and overloading them.

To avoid bot detection, you should try to mimic human behavior as much as possible with randomized and delayed requests and different user agents, which websites use to identify computers on the internet. Here's how to do that in Selenium:

import time
import random
from selenium import webdriver

# Set up the driver (ensure you have the correct path for your webdriver)
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://example.com")

# Scraping logic would go here
time.sleep(random.uniform(2, 5))  # Random delay to mimic human behavior

You may also head specific user agents in your request headers in Python as follows:

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get('https://example.com',   
 headers=headers)

XPath expressions

XPath (XML Path Language) is a powerful querying language used to select specific elements from an XML or HTML document. Their main advantage over CSS selectors is that they are more precise and accurate, especially when dealing with nested or complex HTML structures. 

For example, XPath can be used to extract specific attributes or elements as follows:

from lxml import html

page_content = """<html><body>
<div class="quote"><span class="text">"Quote here"</span><small class="author">Author Name</small></div>
</body></html>"""

tree = html.fromstring(page_content)
quotes = tree.xpath('//div[@class="quote"]/span[@class="text"]/text()')

for quote in quotes:
    print(quote)  # Output: "Quote here"

In this example, //div[@class="quote"]/span[@class="text"]/text() is an XPath expression that selects the text within the span of class text contained in a div of class quote. This hierarchical selection makes XPath particularly useful when the HTML structure is inconsistent or when you need to extract data based on specific attributes. You may learn more about XPath rules here.

Infinite scrolling

Unlike traditionally paginated websites, infinite-scrolling websites don’t follow the typical request-response structure that allows for easy web scraping. 

To scrape such websites, you can use Selenium and keep scrolling until there’s no change in the height of the webpage, as that would signal the end of new content. Here’s how:

from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://example.com/infinite-scroll")   


last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)   
  # Wait for new content to load
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break   
  # Exit the loop if no more content is loaded
    last_height = new_height

In this example, the script keeps scrolling until the page height remains unchanged, indicating that all content has been loaded. Applying delays allows appropriate time for new elements to appear.

Web scraping with SOAX

As you dive deeper into web scraping, add SOAX to your toolkit.

SOAX offers more than 155 million residential proxies, allowing you to scrape websites with high levels of anonymity and a lower risk of IP bans. With automatic IP rotation, you can send multiple requests without triggering anti-bot measures.

Our advanced targeting options let you pick locations down to the city level, helping you access geo-restricted content easily. Plus, with sticky sessions, you can maintain a consistent IP address for scraping tasks that need stability.

SOAX also provides user-friendly web scraping APIs that make data collection easier than ever. These APIs come with built-in features to manage challenges like CAPTCHAs and include error handling, so you can focus on gathering your data without the typical hassles.

To explore all these features, sign up for SOAX’s three-day trial for just $1.99, which gives you 100MB of data to test it all out.

Frequently asked questions

What is web scraping?

Web scraping is the automated method of collecting particular data from websites. In contrast to web crawling, which methodically navigates links throughout a site, web scraping concentrates on retrieving specific information from individual pages or sets of pages.

Web crawling vs. Web scraping >

What is web scraping used for?

Web scraping serves various purposes, including data analysis, market research, price comparison, and content aggregation. For instance, scrapers can pull job listings from career websites to analyze trends or collect movie ratings and reviews to build recommendation engines.

Is web scraping legal?

In general, collecting publicly available information is lawful, although regulations may differ depending on the jurisdiction and a website's terms of service. Importantly, gathering personal data without permission is unlawful. Always give priority to privacy and data protection regulations, and when uncertain, obtain permission.

Why is python good for web scraping?

Python is preferred for web scraping because of its simple syntax and the presence of libraries such as requests for handling HTTP requests and BeautifulSoup for HTML parsing. These characteristics make Python a flexible and user-friendly choice for performing web scraping tasks effectively.

Lisa Whelan

Lisa is a London-based tech expert and AI enthusiast. With a decade of experience, she specializes in writing about data, web scraping, and cybersecurity. She's here to share valuable insights and break down complex technical concepts for the SOAX audience.

Contact author