How to Scrape a Real Website Using Python (+ Code Snippets)

Written by: Robin Geuens

 

In this tutorial, we're going to dive into the nuts and bolts of web scraping using Python.

We'll go through how to scrape a fictional book store and get the title, price, URL, and image URLs for all books with a 5-star rating. Then, we'll compile this information in a CSV file.

Web scraping can be distilled into four core steps:

  1. Inspecting the website: Understanding the structure of the website helps you determine where the data you want is located.
  2. Retrieving the HTML from the website: This step involves making HTTP requests to the website to pull the HTML content.
  3. Processing the HTML and extracting the data: Here, you parse the HTML and extract the required data points.
  4. Storing your data: In our case, we will compile our data into a CSV file.

 

If you want to skip around, you can use the table of contents below:

How to scrape a website using Python

  1. Prerequisites
  2. Inspecting the Website you Want to Scrape
    1. Understanding the Website Structure Using Dev Tools
    2. Fetching the HTML From the Site
    3. Dealing with Dynamic Websites
    4. Handling Scraping Errors
    1. Dealing with Pagination in Web Scraping
  3. Storing Scraped Data in a CSV File
    1. Storing Scraped Data in a JSON File
  4. Web Scraping Best Practices
  5. Frequently Asked Questions

Prerequisites

Before we get into the coding, let's make sure we are ready for it. This guide assumes you have a basic understanding of Python. If you don't have that, there are plenty of good tutorials on Youtube (like this one, for example). 

For this tutorial, we'll use two Python libraries: requests and BeautifulSoup.

  • Requests: This library allows us to make HTTP requests to the website and pull the HTML content. If it's not installed, no problem! You can install it by typing pip install requests into your terminal.
  • BeautifulSoup: This is our tool of choice for processing the HTML and extracting our required data. Think of it as a digital pair of tweezers, enabling us to pluck out the data from the clutter of HTML. Installing BeautifulSoup is a breeze as well. Just type pip install beautifulsoup4 into your terminal.

Now that we've set the stage, are you ready to plunge into the world of web scraping? In the next section, we'll begin by inspecting our website and familiarizing ourselves with its structure. Let's get started!

Inspecting the Website You Want to Scrape

The first step in scraping any website is understanding the website we're planning to scrape. This tutorial will walk you through the process using the site https://books.toscrape.com/ as an example.

book to scrape website

(Click the image to zoom in) 

As you can see, the website has a bunch of different categories, book listings, ratings, and prices. Let's see how we could extract all of that data.

Understanding the Website Structure Using Dev Tools

To access these tools, press ctrl + shift + i on Windows or Command + Option + i on Mac.

Let's put this into practice. After opening the developer tools, try selecting a book element on the page. You'll see that each book on the site is contained in an article tag with the class of product_pod.

This little nugget of information is crucial, as it allows us to compile a list of books with all their details.

web scraping python - inspect page

(Click the image to zoom in) 

Now that we know what class to target, let's see how we can get the actual HTML from the site.

Fetching the HTML From the Site

Now that we have a good grasp of the website structure, it's time to fetch the HTML from the site. We'll use the Python library 'requests' to send a GET request to our target URL (http://books.toscrape.com). Here's a Python script to do that:


import requests

# Base URL   
base_url = 'https://books.toscrape.com'

# Send GET request to the base URL   
response = requests.get(base_url)

# Get the HTML content   
html_content = response.text

# Print the HTML content   
print(html_content)  

Running this script gives us a status code like <Response [200]>, but we're more interested in the actual HTML content. By calling response.text, we get the HTML content of the homepage, which will serve as our starting point for extracting the books.

It's vital to note that this process can vary based on the website type. The site we're using as an example is static and doesn't need login credentials. 

But, for dynamic websites or those requiring login, headless browsers come in handy.

Dealing with Dynamic Websites

Let's say you're trying to scrape a modern, dynamic website like Airbnb, which heavily relies on JavaScript to render its content.

As an experiment, try disabling JavaScript for Airbnb. You'll notice the website becomes unusable.

airbnb without js

(Click the image to zoom in) 

This presents us with a common challenge: how do we scrape data from these dynamic websites? The answer lies in using headless browsers like Selenium, Puppeteer, or Playwright.

While this topic is vast enough to warrant its own dedicated post, let's provide a simplified yet illustrative example here. We're going to extract all the quotes from https://quotes.toscrape.com/js/.

First things first, you'll need to install Selenium, which can be done using pip, the Python package manager. Simply type pip install selenium in your terminal or command prompt and hit Enter.

Once you have Selenium installed, you can import it and use it in your script. Let's dive into an example:


from selenium import webdriver  
   
# Set up the webdriver.
# In this example, we're using the Chrome driver.  
# Replace with the path to your chromedriver  
driver = webdriver.Chrome(executable_path='/path/to/chromedriver'
   
# Navigate to the website  
driver.get("https://quotes.toscrape.com/js/")  
   
# Get the HTML source  
html_source = driver.page_source  
print("HTML source of the website:", html_source)  
   
# Close the browser to free up resources  
driver.close()    

Now we have our HTML and we can process it using Beautifulsoup.

Handling Scraping Errors

Even the best-laid plans can go awry, and the same is true for web scraping. Websites aren't always perfect. Sometimes, you might come across missing or incomplete data.

So, what happens when your carefully crafted scraping script encounters these bumps in the road? It might very well throw errors. But don't worry, that's where Python's try-except blocks ride to the rescue.

Try-except blocks are our safety nets when our code stumbles upon an error it doesn't know how to handle. Let's see them in action:


import requests  
   
# Base URL  
base_url = 'https://books.toscrape.com'  
   
try:  
    # Send GET request to the base URL  
    response = requests.get(base_url)  
   
    # Get the HTML content  
    html_content = response.text  
   
    # Print the HTML content  
    print(html_content)  
   
except requests.RequestException as e:  
    print(f"An error occurred: {e}")  

In the above example, if the GET request to the base URL fails for any reason, the code in the except block steps up. It lets us dictate how the program responds to an error, ensuring it doesn't crash and burn.

In the next section, we'll dive into parsing this HTML content and extracting the data we need. So, keep an eye out!

Working With HTML and Extracting our Data

You've got your HTML page loaded and ready to go. Now what? It's time to dive in and extract the gems of information you need. We'll do this using the Python library, Beautiful Soup. This library is a master key for pulling data out of HTML and XML files.

To start, we need to convert our HTML into a Beautiful Soup object. Here's how you do it:


soup = BeautifulSoup(html_content, 'html.parser')  

Now, our goal is to find all the books on the page. To achieve this, we'll use the find_all() method. This method returns a list of all instances of a particular tag and its associated attributes. But how do we identify which tags to look for? Here are a few ways to locate elements:

  • By ID
  • By Class
  • By XPath (you'll need the lxml library for this)

From our initial research, we know that we need to target the article tag with the class of product_pod. Let's do just that:


books = soup.find_all('article', {'class': 'product_pod'})  

Next, it's time to extract specific details for each book: the title, price, URL, and image URL. We can do this by using find() to search within our 'book' element. Remember, we already know the specific tags and classes we need to target from our research.

  • To get the title, we need to find the 'h3' tag within the 'book' element, locate the link, and then select the link's title.
  • For the price, we look for a 'p' tag with the class 'price_color'.
  • To get the book URL, we find the same 'h3' as the title, but extract the URL instead of the title using ['href'].
  • For the image URL, we search for the 'img' element and select the source URL.

Here's how it looks in code:


for book in soup.find_all('article', {'class': 'product_pod'}):  
    title = book.find('h3').find('a')['title']  
    price = book.find('p', {'class': 'price_color'}).text[1:]  
    book_url = base_url + book.find('h3').find('a')['href']  
    image_url = base_url + book.find('img')['src'][3:]

Notice how we're using slices like [1:] to remove unnecessary characters. You can also start filtering the data you extract.

Let's say you only want 5-star books. You can add an if statement telling Python to check if the book has a five-star rating:


for book in soup.find_all('article', {'class': 'product_pod'}):  
    if book.find('p', {'class': 'star-rating Five'}):  
        title = book.find('h3').find('a')['title']  
        price = book.find('p', {'class': 'price_color'}).text[1:]  
        book_url = base_url + book.find('h3').find('a')['href']  
        image_url = base_url + book.find('img')['src'][3:]  

This will only extract the book details if the book has a 5-star rating. So, here's how our script looks so far:


import requests  
from bs4 import BeautifulSoup  
 
# Base URL  
base_url = 'https://books.toscrape.com'  
 
response = requests.get(base_url)  
 
soup = BeautifulSoup(response.text, 'html.parser')  
 
# Find all books with 5-star rating  
for book in soup.find_all('article', {'class': 'product_pod'}):  
    if book.find('p', {'class': 'star-rating Five'}):  
        try:  
            title = book.find('h3').find('a')['title']  
            price = book.find('p', {'class': 'price_color'}).text[1:]  
            book_url = base_url + book.find('h3').find('a')['href']  
            image_url = base_url + book.find('img')['src'][3:]  
             
            print(title)  
            print(price)  
            print(book_url)  
            print(image_url)  
             
        except Exception as e:  
            print(f"Error processing data for book '{title}': {e}")  

And voila! You've written a script that navigates a webpage, hunts down 5-star books, extracts their title, price, URL, and image URL, and prints them out. 

Dealing with Pagination in Web Scraping

In the previous section, we've scraped the first page of our target website. But what if we wanted to scrape the entire site? 

To scrape data from all the pages, we need to tweak our script to traverse the entire website. In our case, the website has a collection of 50 pages waiting to be scraped.

The website's URL structure gives us a clue how to do it: https://books.toscrape.com/catalogue/page-2.html. Notice the page number at the end of the URL. A simple URL structure like that allows us to use Python's for-loop. 

This loop will go through each page, swapping the page number in the URL with the iteration index.

In Python, the loop's performance would look something like this:


# Looping through all the pages  
for i in range(1, 51):  # The website has 50 pages  
    url = f"https://books.toscrape.com/catalogue/page-{i}.html"  
    response = requests.get(url)  
    soup = BeautifulSoup(response.text, 'html.parser')  

In this script, Python replaces {i} with the current iteration number. This allows the script to load into the next page with each loop.

To merge this into our existing script, we would change our code as follows:


import requests  
from bs4 import BeautifulSoup  
import csv  
   
# Establishing our Base URL  
base_url = 'https://books.toscrape.com'  
   
# Looping through all the pages  
for i in range(1, 51):  # The website has 50 pages  
    url = f"{base_url}/catalogue/page-{i}.html"  
    response = requests.get(url)  
    soup = BeautifulSoup(response.text, 'html.parser')  
   
    # Hunting for all books with a 5-star rating  
    for book in soup.find_all('article', {'class': 'product_pod'}):  
        if book.find('p', {'class': 'star-rating Five'}):  
            try:  
                title = book.find('h3').find('a')['title']  
                price = book.find('p', {'class': 'price_color'}).text[1:]  
                book_url = base_url + book.find('h3').find('a')['href']  
                image_url = base_url + book.find('img')['src'][3:]  
                 
                print(title)  
                print(price)  
                print(book_url)  
                print(image_url)  
             
            except Exception as e:  
                print(f"Error processing data for book '{title}': {e}")
 

Now the scripts will scrape all pages of the site.

There might be times, though, when the URL structure doesn't play nice. In such cases, we can navigate the pages by tracking down the URL of the 'next' button and using that as our guide.

We can find the "next" button the same way we found all the other elements – by using Beautifulsoup's find function

This method is a bit more challenging but can come in handy when dealing with intricate URL structures. Here's an example of a function that would find the next page link:


def get_next_page_url(soup):
    """Return the URL of the next page if it exists,
     otherwise return None."""
    next_button = soup.find('li', class_='next')
    if next_button:
        return next_button.a['href']
    return None  

This function will look through the page's HTML, and try to find a next button. If it finds one, it will return the link. If it can't find anything it'll stop.

Storing Scraped Data in a CSV File

Until now, we have simply printed out the results of our scraping efforts in the console. But let's kick things up a notch. 

What if we want to save this data for future use or analysis? That's where writing to files comes in handy.

First things first, remember that Python will save the CSV file in the current working directory unless you specify another location. Let's begin by importing the CSV library with a simple command: import csv.

After that, we need to create a file. For this example, let's call it 5_star_books.csv. We'll open it in write mode with the command with open('5_star_books.csv', 'w', newline='') as csvfile:.

Then, it's time to create our column headers. Since we're recording the title, price, URL, and image URL of books, here's how we do it:


with open('5_star_books.csv', 'w', newline='') as csvfile:  
    fieldnames = ['Title', 'Price', 'URL', 'Image URL']  
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)  
    writer.writeheader()    

Now that our CSV file is open, we're ready to start scraping and looping through all the pages. As we find each book with a 5-star rating, we write the information to the CSV file using writerow. It looks something like this:


writer.writerow({  
'Title': title,  
'Price': price,  
'URL': book_url,  
'Image URL': image_url})

Our full script at this point should look like the following:


import requests  
from bs4 import BeautifulSoup  
import csv  
 
# Base URL  
base_url = 'http://books.toscrape.com'  
 
# Initialize CSV file  
with open('5_star_books.csv', 'w', newline='') as csvfile:  
    fieldnames = ['Title', 'Price', 'URL', 'Image URL']  
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)  
    writer.writeheader()  
 
    # Loop through all the pages  
    for i in range(1, 51):  # The website has 50 pages  
        url = f'{base_url}/catalogue/page-{i}.html'  
         
        response = requests.get(url)  
        soup = BeautifulSoup(response.text, 'html.parser')  
         
        # Find all books with 5-star rating  
        for book in soup.find_all('article', {'class': 'product_pod'}):  
            if book.find('p', {'class': 'star-rating Five'}):  
                 
                try:  
                    title = book.find('h3').find('a')['title']  
                    price = book.find('p', {'class': 'price_color'}).text[1:]  
                    book_url = base_url + book.find('h3').find('a')['href']  
                    image_url = base_url + book.find('img')['src'][3:]  
                     
                    # Write to CSV  
                    writer.writerow({  
                        'Title': title,  
                        'Price': price,  
                        'URL': book_url,  
                        'Image URL': image_url  
                    })  
                 
                except Exception as e:  
                    print(f"Error processing data for book '{title}': {e}")  

If we run this script, we get a nice CSV of all the 5-star books, along with the book's price, URL, and Image URL. Not bad!

 

web scraping result

(Click the image to zoom in) 

Storing Scraped Data in a JSON File

But what if your requirements demand a JSON file instead of a CSV file? Perhaps you need to merge this data with other applications or the data structure is more complex, making JSON a better fit.

The process for writing to a JSON file is like that of a CSV file. The main difference is that we'll add the book details to a list and then write this list to the JSON file.

Here's how it looks in the code:


import requests  
from bs4 import BeautifulSoup  
import json  
   
# Base URL  
base_url = 'http://books.toscrape.com'  
   
# Initialize a list to store book details  
books_list = []  
   
# Loop through all the pages  
for i in range(1, 51):  # The website has 50 pages  
    url = f'{base_url}/catalogue/page-{i}.html'  
     
    response = requests.get(url)  
    soup = BeautifulSoup(response.text, 'html.parser')  
     
    # Find all books with 5-star rating  
    for book in soup.find_all('article', {'class': 'product_pod'}):  
        if book.find('p', {'class': 'star-rating Five'}):  
             
            try:  
                title = book.find('h3').find('a')['title']  
                price = book.find('p', {'class': 'price_color'}).text[1:]  
                book_url = base_url + book.find('h3').find('a')['href']  
                image_url = base_url + book.find('img')['src'][3:]  
                 
                # Append the book details to the list  
                books_list.append({  
                    'Title': title,  
                    'Price': price,  
                    'URL': book_url,  
                    'Image URL': image_url  
                })  
             
            except Exception as e:  
                print(f"Error processing data for book '{title}': {e}")  
   
# Write the book details to a JSON file  
with open('5_star_books.json', 'w') as json_file:  
    json.dump(books_list, json_file, indent=4)    

Whether you're writing to a CSV or JSON file, Python makes it a breeze to save and organize the data you've scraped from the web. Now you're better equipped to handle your web scraping projects. Happy scraping!

Let's take a moment to recap our journey in the world of web scraping so far:

  • We began by inspecting a website to understand its structure.
  • We utilized the requests library in Python to fetch the HTML content of each web page on the site.
  • We used BeautifulSoup, a Python library, to extract necessary information from the HTML content.
  • Finally, we wrote the extracted information to a CSV and JSON  file.

If you've gotten this far, pat yourself on the back! You just scraped a real website and you're ready for bigger challenges. Let's go over some best practices.

Web Scraping Best Practices

Let's first get acquainted with some key web scraping best practices:

  • Respect the robots.txt file: Websites use this file to lay out how search engines and other web robots should interact with them. Always check this file and adhere to its instructions.
  • Use proxies: If you're doing large-scale web scraping, using proxies is a smart move.  This can bolster your privacy and prevent your IP address from getting blocked. Especially if you're trying to scrape sites like Instagram. Luckily, it's quite easy to use python with a proxy.
  • Avoid overloading the website: Keep your requests within a reasonable limit to prevent disrupting the website's services.
  • Refrain from scraping sensitive or private data: Scraping personal data without consent is both illegal and unethical.
  • Use headers and user agents: By including these in your requests, you can mimic human browsing behavior, which reduces the chance of being flagged as a bot.

Frequently Asked Questions

What is Web Scraping?

Web scraping is a technique used to extract data from websites in an automated manner. It's different from web crawling, which involves following links and exploring various pages on a website, much like a spider. Web scraping, is all about gathering specific information from a single page or a group of pages.

What is Web Scraping Used for?

Web scraping comes in handy for a variety of tasks, such as data analysis, market research, price comparison, and content aggregation. For instance, you could scrape job listings from a career website to examine trends in the job market. Or pull movie ratings and reviews from a media website to build a recommendation engine.

Is Web Scraping Legal?

Scraping publicly accessible information is generally legal, but the specifics can vary depending on your jurisdiction and the website's terms of service. However, scraping personal data without consent is illegal. It's crucial to respect privacy and data protection laws, and if you're unsure, it's better to ask for permission first.

Why is Python Good for Web Scraping?

Python is a popular choice for web scraping because of its simplicity and the wide range of libraries available, like 'requests' for executing HTTP requests and 'BeautifulSoup' for parsing HTML. This makes Python a versatile and straightforward tool for web scraping tasks.

Robin Geuens

Robin is the SEO specialist at SOAX. He likes learning new skills and automating things with Python and GPT. Outside of work he likes reading, playing videogames, and traveling.

Contact author