Back to Blog

Web Scraping

Web Scraping Using Python | 2024 Full Guide

How to scrape website with python? Is it difficult? You will get the specific information and codes for scraping python web scraping in this guide.

Apr 10, 2024Robin Brown

If you know how to extract data, the internet is a vast source of information. Therefore, the demand for web scraping has grown exponentially in recent years, with Python becoming the most popular programming language for web scraping.

In this tutorial, you will:

Learn and run your first Python programme
Learn about website URLs
Learn how to perform in-depth analyses of web pages with Devtool.
Learn the basics of web scraping using Beautiful Soup and Requests in Python.
Learned how to extract data from individual nodes, lists, tables, and more using Beautiful Soup.
Learned how to export the scraped content into various formats, such as JSON and CSV.

Let's dive into the world of web scraping using Python!

What is Web Scraping in Python?

Web scraping is the process of retrieving data from the web. Even copying and pasting content from a page is a form of scraping! However, the term typically refers to tasks automated by software, essentially scripts (also known as bots, spiders, or crawlers) that access websites and extract desired data from their pages. In our example, we'll use Python.

It's worth noting that many websites implement anti-scraping techniques for various reasons. But don't worry, as we'll show you how to bypass them later!

Understand the Target Website!

Are you ready to get started? You'll build a real spider to retrieve data from ScrapeMe, a Pokémon e-commerce website built for learning web scraping.

What you'll see is just an example, aimed at understanding how web scraping works in Python. However, remember that you can apply what you learn here to any other website. The process might be more complex, but the key concepts to follow remain the same.

Before diving into writing some code, you need to meet some prerequisites:

Setting Up Your Environment

To build a data scraper using Python, you'll need to download and install the following tools:

Python 3.11+: This tutorial refers to Python 3.11.2, the latest version at the time of writing.
pip: The Python Package Index (PyPi), which allows you to install libraries with a single command.
IDE: Any IDE that supports Python can be used.
Note: If you're a Windows user, don't forget to select the option to Add python.exe to PATH during the installation wizard. This enables Windows to use python and pip commands in the terminal. Note that pip is included by default with Python 3.4 or later versions, so you don't need to install it manually.

Now, You have gotten everything you need to build your first web scraper with Python.

Let's going on!

Initialize Python Project

Create a folder and inside it create a file named main.py. Open this file in your IDE.
Write the following code in the file:

python Copy

def print_hello(name):
    print(f"Hello, {name}!")

if __name__ == "__main__":
    print_hello('world')

Execute the following command in the terminal to verify:

bash Copy

python3 main.py

If you see the following output in the terminal, it means you have successfully run Python:

Copy

Hello, world!

Analyzing the Target Website

You might be eager to start coding right away, but that's not the best approach. First, you need to spend some time understanding your target website. This might sound tedious or unnecessary, but it's the only way to study the site's structure and figure out how to scrape data from it. Every scraping project begins this way.

Browsing the Website

Browse the website, try out the search functionality, click some buttons, and observe the website's responses and page structure.

Web servers return an HTML document based on the requested URL, with each document associated with a specific page. Consider the URL for the fourth page of the product list:

https://scrapeme.live/shop/page/4/

You can break any of them down into two main parts:

Base URL: The path to the website's shop section. Here it's https://scrapeme.live/shop/page/4/.
Specific page location: The path to the specific product. The URL may end in .html, .php, or may not have an extension at all.

All the products offered on the website will have the same base URL. What varies between each page is its latter half, which contains a string specifying which product page the server should return. Typically, URLs for pages of the same type share a similar format.

Furthermore, URLs can also contain additional information:

Path parameters: These are used to capture specific values in RESTful methods (e.g., the path parameter in https://scrapeme.live/shop/page/4/ is 4).
Query parameters: These parameters are added after a question mark (?) at the end of the URL. They often encode filter values to be sent to the server when performing a search (e.g., in https://www.example.com/search?search=blabla&sort=newest, search=blabla and sort=newest are query parameters).

It's important to note that any query parameter string contains:

?: Marks the beginning.
key=value pairs of parameters separated by &: where key is the name of a parameter, and value represents its value. Query strings contain parameters in key-value pairs separated by &.

In other words, URLs are not just simple location strings for HTML documents. They can contain parameter information that servers use to run queries and populate pages with specific data.

Using DevTools to Inspect the Website

You're now familiar with the website. The next step is to delve into the HTML code of the pages, studying their structure and content to understand how to extract data from them.

All modern browsers come with a set of advanced developer tools, most of which offer nearly identical functionalities. These tools allow you to explore the HTML code of web pages and work with it. In this Python web scraping tutorial, you'll see the practical application of Chrome's DevTools.

Right-click on an HTML element and choose Inspect to open the DevTools window. If the website has disabled the right-click menu, do the following:

On macOS: Select "View > Developer > Developer Tool" from the menu bar.
On Windows and Linux: Click the "⋮" menu button in the top right corner, then choose "More Tools > Developer tools".

They allow you to inspect the structure of the Document Object Model (DOM) of the webpage. This, in turn, helps you understand the source code more deeply. In the DevTools section, navigate to the Elements tab to access the DOM.

The Challenges Faced by Web Scraping

However, web scraping faces numerous challenges in today's advanced online landscape. Websites often

impose access restrictions and monitoring measures, hindering the crawling process and data retrieval
implement human verification mechanisms like CAPTCHA, which can be time-consuming and labor-intensive to handle manually.

Nstbrowser has been regarded as one of the solutions to these challenges. With Nstbrowser, you can overcome these obstacles effortlessly. It employs browser emulation techniques to mimic human behavior, reducing the risk of triggering automated protection mechanisms. User-Agent rotation ensures that your scraping activities remain undetected.

How to Scrape Website with Python?

Now, all the preparations we have finished! The most detailed steps are on the following:

Step 1. Download the HTML Page

Get ready to fire up your Python, as you are all set to write some code.
Assuming you want to scrape some data from the following location:

typescript Copy

https://scrapeme.live/shop/

To retrieve the HTML code of the target page, you first need to download the HTML document associated with the page URL. To achieve this, you can use the Python requests library.

To install the requests library, use the following command:

Shell Copy

pip install requests

Create a new scraper.py file and add the following code to the file:

python Copy

import requests

# download the HTML document with an HTTP GET request
response = requests.get("https://scrapeme.live/shop/")

# print the HTML content of the page
print(response.text)

This code imports the requests library and then uses it to make a GET request to the URL of the target page. It returns the response content, which includes the HTML document.

To print the text attribute of the response, you will be able to see the structure of the target page's code.

html Copy

<!doctype html>
<html lang="en-GB">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0">
<link rel="profile" href="http://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapeme.live/xmlrpc.php">

<title>Products &#8211; ScrapeMe</title>
<!-- rest of the page omitted for brevity...  -->

Common mistake: Forgetting to handle exceptions.

A GET request can fail for various reasons, such as the server being temporarily unavailable, an incorrect URL, or your IP being blocked. Therefore, it is important to handle errors in the following way:

python Copy

import requests

# download the HTML document with an HTTP GET request
response = requests.get("https://scrapeme.live/shop/")

if response.ok:
  # scraping logic here
else:
  # log the error response
  # in case of 4xx or 5xx
  print(response)

By using error handling techniques, your script will not crash if there is an error during the request. It will only continue if the response status code is in the 2xx range

Step 2. Parsing HTML Content

In the previous step, you have retrieved the HTML document from the server. If you look at it, you will see a long string of code and the only way to understand it is to extract the required data through HTML parsing.

Beautiful Soup is a Python library for parsing XML and HTML content, and it exposes an API for exploring HTML code. In other words, it allows you to select HTML elements and easily extract data from them.

To install the library, type and execute the following command in a terminal

Shell Copy

pip install beautifulsoup4

The page is first requested using requests and then used to parse the retrieved content:

python Copy

import requests
from bs4 import BeautifulSoup

# download the HTML document with an HTTP GET request
response = requests.get("https://scrapeme.live/shop/")

if response.ok:
  # scraping logic here
  soup = BeautifulSoup(response.text, 'html.parser')
else:
  # log the error response
  # in case of 4xx or 5xx
  print(response)

It's important to note that websites contain data in various formats. Single elements, lists, and tables are just a few examples. To make your Python scraper effective, you need to know how to use Beautiful Soup in many scenarios. Let's take a look at how to tackle the most common challenges!

Step 3. Extracting Data from a Single Element

Beautiful Soup provides several methods to select HTML elements from the DOM, and the id attribute is the most efficient way to select a single element. As the name suggests, the id uniquely identifies an HTML node on the page.

To find the search box, right-click and open the Devtools to view its id:

As you can see, the <input> element has the following id:

python Copy

woocommerce-product-search-field-0

You can use this information to select the product search element:

python Copy

product_search_element = soup.find(id="woocommerce-product-search-field-0")

The find() function allows you to extract a single HTML element from the DOM.

Please note that the id is an optional attribute. That's why there are other methods to select elements:

By tag: Use find() without any parameters:

python Copy

h1_element = soup.find("h1")

By class: Add the class_ parameter:

python Copy

search_input_element = soup.find(class_="search_field")

By attribute: Use the attrs parameter:

python Copy

search_input_element = soup.find(attrs={"name": "s"})

Step 4. Fetching a list of elements

Now that you have learned how to navigate the page and extract information from elements, let's dive into web scraping for real!

Web pages often contain lists of elements, such as product listings in an e-commerce store. Retrieving data from them can be time-consuming, but this is where Python's Beautiful Soup web scraping comes into play!

The product list in Pokemon is contained within <li> elements:

Let's get them:

python Copy

product_elements = soup.select("li.product")

Iterate through all their product data as follows:

python Copy

for product_element in product_elements:
    product_name = product_element.find("h2").get_text()
    product_url = product_element.find("a")["href"]
    product_image = product_element.find("img")["src"]
    product_price = product_element.select_one(".amount").get_text()

When handling a list of elements, it is recommended to store the scraped data in a list of dictionaries. In Python, a dictionary is an unordered collection of key-value pairs, and you can use it as follows:

python Copy

# the list of dictionaries containing the
  # scrape data
  pokemon_products = []

  for product_element in product_elements:
    product_name = product_element.find("h2").get_text()
    product_url = product_element.find("a")["href"]
    product_image = product_element.find("img")["src"]
    product_price = product_element.select_one(".amount").get_text()

    # define a dictionary with the scraped data
    new_pokemon_product = {
      "name": product_name,
      "url": product_url,
      "image": product_image,
      "price": product_price
    }
    # add the new product dictionary to the list
    pokemon_products.append(new_pokemon_product)

  # print the list of dictionaries
  print(pokemon_products)

You now have a list called pokemon_products that contains all the information scraped from each individual product on the page.

Fantastic! You now have all the building blocks you need to build a data scraper using Beautiful Soup in Python. But let's keep moving forward; the tutorial is not over yet!

Step 5. Exporting the scraped data

Retrieving web page content is often the first step in a larger process. The next step is to use the scraped information for different needs and purposes. Therefore, it is crucial to convert it into a format that is easy to read and explore, such as CSV or JSON.

You have found the product information in the pokemon_products list provided earlier. Now, let's learn how to convert it into a new format and export it to a file in Python!

Exporting as CSV

CSV is a popular format for data exchange, storage, and analysis, especially when dealing with large datasets. CSV files store information in a tabular form, with values separated by commas. This makes it compatible with spreadsheet programs like Microsoft Excel.

Here's how you can convert a list of dictionaries into a CSV file in Python:

python Copy

import csv

# scraping logic...

# write the scraped data to a CSV file
csv_file = open("pokemon_products.csv", "w", encoding="utf-8", newline="")

# create a CSV writer object
writer = csv.writer(csv_file)

# convert each element of the list to a row in the CSV file
for pokemon_product in pokemon_products:
writer.writerow(pokemon_product.values())

# release the resources
csv_file.close()

Exporting as JSON

JSON is a lightweight, versatile, and popular data interchange format, especially in web applications. It is commonly used to transfer information between servers or between clients and servers through APIs. Many programming languages support it.

Here's the code to export the list of dictionaries as JSON in Python:

python Copy

import json

# scraping logic...

# create the pokemon_products.json
json_file = open("pokemon_products.json", "w")

# convert pokemon_products to JSON
# and write it into the JSON output file
json.dump(pokemon_products, json_file)

# release the file resources
json_file.close()

The export logic mentioned above revolves around the json.dump() function, which comes from Python's standard json module and allows you to write Python objects into a JSON formatted file.

json.dump() takes two parameters:

The Python object to be converted into JSON format.
The file object initialized with the location where the JSON data will be written, using the open() function.

After completing all the steps mentioned above, you can navigate to the folder where your code is located and you will find that the CSV and JSON files you scraped and exported have been created!