If you know how to extract data, the internet is a vast source of information. Therefore, the demand for web scraping has grown exponentially in recent years, with Python becoming the most popular programming language for web scraping.
In this tutorial, you will:
Let's dive into the world of web scraping using Python!
Web scraping is the process of retrieving data from the web. Even copying and pasting content from a page is a form of scraping! However, the term typically refers to tasks automated by software, essentially scripts (also known as bots, spiders, or crawlers) that access websites and extract desired data from their pages. In our example, we'll use Python.
It's worth noting that many websites implement anti-scraping techniques for various reasons. But don't worry, as we'll show you how to bypass them later!
Are you ready to get started? You'll build a real spider to retrieve data from ScrapeMe, a Pokémon e-commerce website built for learning web scraping.
What you'll see is just an example, aimed at understanding how web scraping works in Python. However, remember that you can apply what you learn here to any other website. The process might be more complex, but the key concepts to follow remain the same.
Before diving into writing some code, you need to meet some prerequisites:
To build a data scraper using Python, you'll need to download and install the following tools:
Python 3.11+: This tutorial refers to Python 3.11.2, the latest version at the time of writing.
pip: The Python Package Index (PyPi), which allows you to install libraries with a single command.
IDE: Any IDE that supports Python can be used.
Note: If you're a Windows user, don't forget to select the option to Add python.exe to PATH
during the installation wizard. This enables Windows to use python
and pip
commands in the terminal. Note that pip
is included by default with Python 3.4 or later versions, so you don't need to install it manually.
Now, You have gotten everything you need to build your first web scraper with Python.
Let's going on!
Create a folder and inside it create a file named main.py. Open this file in your IDE.
Write the following code in the file:
def print_hello(name):
print(f"Hello, {name}!")
if __name__ == "__main__":
print_hello('world')
Execute the following command in the terminal to verify:
python3 main.py
If you see the following output in the terminal, it means you have successfully run Python:
Hello, world!
You might be eager to start coding right away, but that's not the best approach. First, you need to spend some time understanding your target website. This might sound tedious or unnecessary, but it's the only way to study the site's structure and figure out how to scrape data from it. Every scraping project begins this way.
Browse the website, try out the search functionality, click some buttons, and observe the website's responses and page structure.
Web servers return an HTML document based on the requested URL, with each document associated with a specific page. Consider the URL for the fourth page of the product list:
https://scrapeme.live/shop/page/4/
You can break any of them down into two main parts:
https://scrapeme.live/shop/page/4/
..html
, .php
, or may not have an extension at all.All the products offered on the website will have the same base URL. What varies between each page is its latter half, which contains a string specifying which product page the server should return. Typically, URLs for pages of the same type share a similar format.
Furthermore, URLs can also contain additional information:
https://scrapeme.live/shop/page/4/
is 4
).?
) at the end of the URL. They often encode filter values to be sent to the server when performing a search (e.g., in https://www.example.com/search?search=blabla&sort=newest
, search=blabla
and sort=newest
are query parameters).It's important to note that any query parameter string contains:
?
: Marks the beginning.key=value
pairs of parameters separated by &
: where key
is the name of a parameter, and value
represents its value. Query strings contain parameters in key-value pairs separated by &
.In other words, URLs are not just simple location strings for HTML documents. They can contain parameter information that servers use to run queries and populate pages with specific data.
You're now familiar with the website. The next step is to delve into the HTML code of the pages, studying their structure and content to understand how to extract data from them.
All modern browsers come with a set of advanced developer tools, most of which offer nearly identical functionalities. These tools allow you to explore the HTML code of web pages and work with it. In this Python web scraping tutorial, you'll see the practical application of Chrome's DevTools.
Right-click on an HTML element and choose Inspect
to open the DevTools window. If the website has disabled the right-click menu, do the following:
They allow you to inspect the structure of the Document Object Model (DOM) of the webpage. This, in turn, helps you understand the source code more deeply. In the DevTools section, navigate to the Elements
tab to access the DOM.
However, web scraping faces numerous challenges in today's advanced online landscape. Websites often
Nstbrowser has been regarded as one of the solutions to these challenges. With Nstbrowser, you can overcome these obstacles effortlessly. It employs browser emulation techniques to mimic human behavior, reducing the risk of triggering automated protection mechanisms. User-Agent rotation ensures that your scraping activities remain undetected.
Now, all the preparations we have finished! The most detailed steps are on the following:
Get ready to fire up your Python, as you are all set to write some code.
Assuming you want to scrape some data from the following location:
https://scrapeme.live/shop/
To retrieve the HTML code of the target page, you first need to download the HTML document associated with the page URL. To achieve this, you can use the Python requests library.
To install the requests library, use the following command:
pip install requests
Create a new scraper.py
file and add the following code to the file:
import requests
# download the HTML document with an HTTP GET request
response = requests.get("https://scrapeme.live/shop/")
# print the HTML content of the page
print(response.text)
This code imports the requests
library and then uses it to make a GET request to the URL of the target page. It returns the response content, which includes the HTML document.
To print the text
attribute of the response, you will be able to see the structure of the target page's code.
<!doctype html>
<html lang="en-GB">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0">
<link rel="profile" href="http://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapeme.live/xmlrpc.php">
<title>Products – ScrapeMe</title>
<!-- rest of the page omitted for brevity... -->
Common mistake: Forgetting to handle exceptions.
A GET request can fail for various reasons, such as the server being temporarily unavailable, an incorrect URL, or your IP being blocked. Therefore, it is important to handle errors in the following way:
import requests
# download the HTML document with an HTTP GET request
response = requests.get("https://scrapeme.live/shop/")
if response.ok:
# scraping logic here
else:
# log the error response
# in case of 4xx or 5xx
print(response)
By using error handling techniques, your script will not crash if there is an error during the request. It will only continue if the response status code is in the 2xx
range
In the previous step, you have retrieved the HTML document from the server. If you look at it, you will see a long string of code and the only way to understand it is to extract the required data through HTML parsing.
Beautiful Soup is a Python library for parsing XML and HTML content, and it exposes an API for exploring HTML code. In other words, it allows you to select HTML elements and easily extract data from them.
To install the library, type and execute the following command in a terminal
pip install beautifulsoup4
The page is first requested using requests and then used to parse the retrieved content:
import requests
from bs4 import BeautifulSoup
# download the HTML document with an HTTP GET request
response = requests.get("https://scrapeme.live/shop/")
if response.ok:
# scraping logic here
soup = BeautifulSoup(response.text, 'html.parser')
else:
# log the error response
# in case of 4xx or 5xx
print(response)
It's important to note that websites contain data in various formats. Single elements, lists, and tables are just a few examples. To make your Python scraper effective, you need to know how to use Beautiful Soup in many scenarios. Let's take a look at how to tackle the most common challenges!
Beautiful Soup provides several methods to select HTML elements from the DOM, and the id
attribute is the most efficient way to select a single element. As the name suggests, the id
uniquely identifies an HTML node on the page.
To find the search box, right-click and open the Devtools to view its id:
As you can see, the <input>
element has the following id:
woocommerce-product-search-field-0
You can use this information to select the product search element:
product_search_element = soup.find(id="woocommerce-product-search-field-0")
The find()
function allows you to extract a single HTML element from the DOM.
Please note that the id
is an optional attribute. That's why there are other methods to select elements:
By tag: Use find()
without any parameters:
h1_element = soup.find("h1")
By class: Add the class_
parameter:
search_input_element = soup.find(class_="search_field")
By attribute: Use the attrs
parameter:
search_input_element = soup.find(attrs={"name": "s"})
Now that you have learned how to navigate the page and extract information from elements, let's dive into web scraping for real!
Web pages often contain lists of elements, such as product listings in an e-commerce store. Retrieving data from them can be time-consuming, but this is where Python's Beautiful Soup web scraping comes into play!
The product list in Pokemon is contained within <li>
elements:
Let's get them:
product_elements = soup.select("li.product")
Iterate through all their product data as follows:
for product_element in product_elements:
product_name = product_element.find("h2").get_text()
product_url = product_element.find("a")["href"]
product_image = product_element.find("img")["src"]
product_price = product_element.select_one(".amount").get_text()
When handling a list of elements, it is recommended to store the scraped data in a list of dictionaries. In Python, a dictionary is an unordered collection of key-value pairs, and you can use it as follows:
# the list of dictionaries containing the
# scrape data
pokemon_products = []
for product_element in product_elements:
product_name = product_element.find("h2").get_text()
product_url = product_element.find("a")["href"]
product_image = product_element.find("img")["src"]
product_price = product_element.select_one(".amount").get_text()
# define a dictionary with the scraped data
new_pokemon_product = {
"name": product_name,
"url": product_url,
"image": product_image,
"price": product_price
}
# add the new product dictionary to the list
pokemon_products.append(new_pokemon_product)
# print the list of dictionaries
print(pokemon_products)
You now have a list called pokemon_products
that contains all the information scraped from each individual product on the page.
Fantastic! You now have all the building blocks you need to build a data scraper using Beautiful Soup in Python. But let's keep moving forward; the tutorial is not over yet!
Retrieving web page content is often the first step in a larger process. The next step is to use the scraped information for different needs and purposes. Therefore, it is crucial to convert it into a format that is easy to read and explore, such as CSV or JSON.
You have found the product information in the pokemon_products
list provided earlier. Now, let's learn how to convert it into a new format and export it to a file in Python!
CSV is a popular format for data exchange, storage, and analysis, especially when dealing with large datasets. CSV files store information in a tabular form, with values separated by commas. This makes it compatible with spreadsheet programs like Microsoft Excel.
Here's how you can convert a list of dictionaries into a CSV file in Python:
import csv
# scraping logic...
# write the scraped data to a CSV file
csv_file = open("pokemon_products.csv", "w", encoding="utf-8", newline="")
# create a CSV writer object
writer = csv.writer(csv_file)
# convert each element of the list to a row in the CSV file
for pokemon_product in pokemon_products:
writer.writerow(pokemon_product.values())
# release the resources
csv_file.close()
JSON is a lightweight, versatile, and popular data interchange format, especially in web applications. It is commonly used to transfer information between servers or between clients and servers through APIs. Many programming languages support it.
Here's the code to export the list of dictionaries as JSON in Python:
import json
# scraping logic...
# create the pokemon_products.json
json_file = open("pokemon_products.json", "w")
# convert pokemon_products to JSON
# and write it into the JSON output file
json.dump(pokemon_products, json_file)
# release the file resources
json_file.close()
The export logic mentioned above revolves around the json.dump()
function, which comes from Python's standard json
module and allows you to write Python objects into a JSON formatted file.
json.dump()
takes two parameters:
open()
function.After completing all the steps mentioned above, you can navigate to the folder where your code is located and you will find that the CSV and JSON files you scraped and exported have been created!
Congratulations! With this tutorial, you have learned how to perform web scraping using Python and extract information from web pages.