A dynamic web page is one where the content is not all directly embedded in static HTML but is generated through server-side or client-side rendering.
It can display data in real time based on user actions, such as loading more content when the user clicks a button or scrolls down the page (such as infinite scrolling). This design improves the user experience and allows users to get relevant information without reloading the entire page.
To determine whether a website is a dynamic web page, you can disable JavaScript in your browser. If the website is dynamic, most of the content will disappear.
Fingerprinting techniques: Many websites use fingerprinting techniques to detect and block automated scrapers. These techniques create a unique "fingerprint" for each visitor by analyzing information such as browser behavior, screen resolution, plugins, time zone, etc. If anomalies are detected or inconsistent with regular user behavior, the website may block access.
Blocking mechanisms: To protect their content, websites implement various blocking mechanisms, such as:
Do you have any wonderful ideas or doubts about web scraping and Browserless?
Let's see what other developers are sharing on Discord and Telegram!
An effective method to intercept XHR (XMLHttpRequest) and Fetch requests during the scraping process is to inspect the browser's Network tab to identify API endpoints that provide dynamic content. Once these endpoints are identified, HTTP clients such as the Requests library can be used to send requests directly to these APIs to obtain data.
Using a headless browser like Puppeteer or Selenium allows full simulation of user behavior, including page loading and interaction. These tools are able to process JavaScript and scrape dynamically generated content.
Requesting data directly from the website's API is an efficient way to crawl. Analyze the website's network requests, find the API endpoint, and use the HTTP client to request data.
By monitoring network requests, identifying AJAX calls, and reproducing them, dynamically loaded data can be extracted.
In order to avoid being blocked by the website, using proxy services and IP rotation is an important strategy. This can help disperse requests and reduce the risk of detection.
Writing scripts to simulate human browsing behavior, such as adding delays between requests, randomizing the order of operations, etc., can help reduce the risk of being identified as a crawler by the website.
Browserless is a headlesschrome cloud service that operates online applications and automates scripts without a graphical user interface. For jobs like web scraping and other automated operations, it is especially helpful.
Browserless is also a powerful headless browser. Next, we will use Browserless as an example to crawl dynamic web pages.
Before we start, we need to have a Browserless service. Using Browserless can solve complex web crawling and large-scale automation tasks, and now fully managed cloud deployment has been achieved.
Browserless adopts a browser-centric strategy, provides powerful headless deployment capabilities, and provides higher performance and reliability. You can click here to learn more about the configuration of Browserless services.
At the very beginning, we needed to get the API KEY of Nstbrowser. You can go to the Browserless menu page of the Nstbrowser client, or you can click here to visit.
Before we start, let's determine the goal of this test. We will use Puppeteer and Playwright to get the page title content of dynamic websites:
Follow the steps below to install dependencies:
npm init -y
pnpm add playwright puppeteer-core
const { chromium } = require('playwright');
async function createBrowser() {
const token = ''; // required
const config = {
proxy:
'', // required; input format: schema://user:password@host:port eg: http://user:password@localhost:8080
// platform: 'windows', // support: windows, mac, linux
// kernel: 'chromium', // only support: chromium
// kernelMilestone: '124', // support: 113, 120, 124
// args: {
// '--proxy-bypass-list': 'detect.nstbrowser.io',
// }, // browser args
// fingerprint: {
// userAgent:
// 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.6613.85 Safari/537.36',
// },
};
const query = new URLSearchParams({
token: token, // required
config: JSON.stringify(config),
});
const browserWSEndpoint = `ws://less.nstbrowser.io/connect?${query.toString()}`;
const browser = await chromium.connectOverCDP(browserWSEndpoint);
const context = await browser.newContext();
const page = await context.newPage();
page.goto('https://www.nstbrowser.io/en');
// sleep for 5 seconds
await new Promise((resolve) => setTimeout(resolve, 5000));
const h1Element = await page.$('h1');
const content = await h1Element?.textContent();
console.log(`Playwright: The content of the h1 element is: ${content}`)
await page.close();
await page.context().close();
}
createBrowser().then();
const puppeteer = require('puppeteer-core');
async function createBrowser() {
const token = ''; // required
const config = {
proxy:
'', // required; input format: schema://user:password@host:port eg: http://user:password@localhost:8080
// platform: 'windows', // support: windows, mac, linux
// kernel: 'chromium', // only support: chromium
// kernelMilestone: '124', // support: 113, 120, 124
// args: {
// '--proxy-bypass-list': 'detect.nstbrowser.io',
// }, // browser args
// fingerprint: {
// userAgent:
// 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.6613.85 Safari/537.36',
// },
};
const query = new URLSearchParams({
token: token, // required
config: JSON.stringify(config),
});
const browserWSEndpoint = `ws://less.nstbrowser.io/connect?${query.toString()}`;
const browser = await puppeteer.connect({
browserWSEndpoint: browserWSEndpoint,
defaultViewport: null,
});
const page = await browser.newPage();
await page.goto('https://www.nstbrowser.io/en');
// sleep for 5 seconds
await new Promise((resolve) => setTimeout(resolve, 5000));
const h1Element = await page.$('h1');
if (h1Element) {
const content = await page.evaluate((el) => el.textContent, h1Element); // use page.evaluate to get the text content of the h1 element
console.log(`Puppeteer: The content of the h1 element is: ${content}`);
} else {
console.log('No h1 element found.');
}
await page.close();
}
createBrowser().then();
Crawling dynamic web pages is always more complicated than crawling ordinary web pages. It is easy to encounter various troubles during the crawling process. Through the introduction of this blog, you must have learned: