If your today's work is to scrape pricing page information from competitors' websites. How would you achieve it? Copying and pasting? Manually entering data? Actually no! They are absolutely spending your most time and may make some errors.
It is necessary to point out that Python has become one of the most popular programming languages for data scraping. What are the charms of it?
Let's get started to enjoy the world of web scraping with Python!
Web scraping is a process of extracting data from websites. This can be done manually, but it's better to utilize some automated tools or scripts for collecting large amounts of data efficiently and accurately. Copy and paste from pages is also doing web scraping actually.
Python is regarded as one of the best choice for web scraping due to several reasons:
Java is also an important language for web scraping. You can learn 3 wonderful methods in the Java Web Scraping tutorial.
Do you have any wonderful ideas and doubts about web scraping and Browserless?
Let's see what other developers are sharing on Discord and Telegram!
Are you ready to start your journey into web scraping with Python? Before figuring out the essential steps, ensure you know what to expect and how to proceed.
Web scraping involves a systematic process comprising four main tasks:
1. Inspecting the Target Pages
Before extracting data, you need to understand the website's layout and data structure:
2. Retrieving HTML Content
To scrape a website, you first need to access its HTML content:
3. Extracting Data from HTML
Once you have the HTML, the next step is to extract the desired information:
4. Storing Extracted Data
After extracting the data, it’s crucial to store it in an accessible format:
Tip: Websites are dynamic, so regularly review and update your scraping process to keep the data current.
Python web scraping can be applied in various scenarios, including:
Web scraping comes with its own set of challenges:
Additionally, websites employ anti-bot measures like IP blocking, JavaScript challenges, and CAPTCHAs. These can be circumvented with techniques such as rotating proxies and headless browsers.
Stuck in web crawling trouble?
Bypass anti-bot detection to simplify web scraping and automation
Try Free Nstbrowser!
While web scraping is versatile, there are alternatives:
Despite these alternatives, web scraping remains a popular choice due to its flexibility and comprehensive data access capabilities.
Embark on your web scraping journey with Python, and unlock the vast potential of online data!
At the very beginning, we need to install our shell:
pip install selenium requests json
After the installation is complete, please create a new scraping.py file and introduce the library we just installed in the file:
import json
from urllib.parse import quote
from urllib.parse import urlencode
import requests
from requests.exceptions import HTTPError
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
To get an exact demonstration, we will utilize Nstbrowser, a totally free anti-detect browser as a tool for finishing our task:
def create_and_connect_to_browser():
host = '127.0.0.1'
api_key = 'xxxxxxx' # your api-key
config = {
'once': True,
'headless': False, # headless
'autoClose': True,
'remoteDebuggingPort': 9226,
'userAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'fingerprint': { # required
'name': 'custom browser',
'platform': 'windows', # support: windows, mac, linux
'kernel': 'chromium', # only support: chromium
'kernelMilestone': '120',
'hardwareConcurrency': 4, # support: 2, 4, 8, 10, 12, 14, 16
'deviceMemory': 4, # support: 2, 4, 8
'proxy': '', # input format: schema://user:password@host:port eg: http://user:password@localhost:8080
}
}
query = urlencode({
'x-api-key': api_key, # required
'config': quote(json.dumps(config))
})
url = f'http://{host}:8848/devtool/launch?{query}'
print('devtool url: ' + url)
port = get_debugger_port(url)
debugger_address = f'{host}:{port}'
print("debugger_address: " + debugger_address)
After connecting to Nstbrowser, we connect to Selenium via the debugger address returned to us by Nstbrowser:
def exec_selenium(debugger_address: str):
options = webdriver.ChromeOptions()
options.add_experimental_option("debuggerAddress", debugger_address)
# Replace with the corresponding version of WebDriver path.
chrome_driver_path = r'./chromedriver' # your chrome driver path
service = ChromeService(executable_path=chrome_driver_path)
driver = webdriver.Chrome(service=service, options=options)
From now on, we have successfully started Nstbrowser via Selenium. Begin crawling now!
driver.get("https://www.imdb.com/chart/top")
python scraping.py
As you can see, we successfully cranked up Nstbrowser and visited our target site.
We can use Selenium to get this kind of dom structure and analyze their content:
movies = driver.find_elements(By.CSS_SELECTOR, "li.cli-parent")
for row in movies:
title = row.find_element(By.CLASS_NAME, 'ipc-title-link-wrapper') # get title
year = row.find_element(By.CSS_SELECTOR, 'span.cli-title-metadata-item') # get created year
rate = row.find_element(By.CLASS_NAME, 'ipc-rating-star') # get rate
move_item = {
"title": title.text,
"year": year.text,
"rate": rate.text
}
print(move_item)
Of course, outputting this information in the terminal is not our goal. Next, we need to save the data we crawled.
We use the JSON library to save the retrieved data to a JSON file:
movies = driver.find_elements(By.CSS_SELECTOR, "li.cli-parent")
movies_info = []
for row in movies:
title = row.find_element(By.CLASS_NAME, 'ipc-title-link-wrapper')
year = row.find_element(By.CSS_SELECTOR, 'span.cli-title-metadata-item')
rate = row.find_element(By.CLASS_NAME, 'ipc-rating-star')
move_item = {
"title": title.text,
"year": year.text,
"rate": rate.text
}
movies_info.append(move_item)
# create the json file
json_file = open("movies.json", "w")
# convert movies_info to JSON
json.dump(movies_info, json_file)
# release the file resources
json_file.close()
How to do web scraping with Python and Selenium? This detailed tutorial covered everything you are searching for. To make a comprehensive understanding, we talked about the concept and the advantages of Python for web scraping. Then, it comes to the specific steps by taking a free anti-detect browser - Nstbrowser as an example. I'm sure you have learned a lot about Python web scraping now! Time to operate your project and collect data.