Crawlers and bot programs can make a large number of requests in a short period of time, consuming server resources and causing website performance to degrade or even crashing the website.
Anti-bot systems can help websites manage and limit these requests, thus maintaining the stability and availability of the website.
Some bot programs crawl content on websites for unauthorized use, such as content theft and data scraping. Anti-bots can help protect data and content on websites from unauthorized access and misuse.
Malicious bot programs can be used for a variety of attacks, such as Distributed Denial of Service (DDoS) attacks, brute force password cracking, etc.
An anti-bot system can help identify and block these malicious behaviors, improving the overall security of a website.
Some bot programs may try to obtain user's personal information, such as email addresses, contact information, etc.
Anti-robot systems can help protect user privacy and prevent this information from being illegally collected and misused.
When a robot program accesses a website in large numbers, it may affect the speed and experience of normal users.
By limiting bot traffic, websites can ensure a better experience for real users.
Some bot programs simulate users clicking on ads to commit ad fraud, resulting in losses for advertisers.
Anti-bot detection can identify and block these fake clicks, protecting advertisers' interests.
Anti-bot identifies and blocks bot traffic through a variety of techniques and methods. Here we mainly talk about 6 common analysis methods:
Use machine learning algorithms to analyze and identify behavioral differences between normal users and bots. Machine learning models can continuously learn and adapt to new bot behaviors.
Websites can check for specific JavaScript variables on a page that are commonly associated with the use of Puppeteer.
For example, they may look for variable names that contain "puppeteer" or other relevant identifiers.
for (let key in window) {
if (key.includes('puppeteer') || key.includes('webdriver')) {
// Detected Puppeteer
}
}
Puppeteer also modifies browser behavior to automate tasks. As a result, sites may check for the presence and value of a property like navigator.webdriver
, or other automation indicator flags to determine if an automation tool is controlling the browser.
This property is typically set to true
in Puppeteer.
Want to bypass bot detection quickly? Start using Nstbrowser for free now!
Nstbrowser offers
Nstbrowser not only uses real browser fingerprints for web access but also simulates the behavior and habits of real users. Making it unrecognizable to anti-bots.
In addition, to simplify web scraping and automation, Nstbrowser is equipped with powerful website unblocker technology to provide a seamless web access experience.
As mentioned above, bot detection has become a major problem for web crawler programs. But don't worry! We can still solve it easily.
Besides using Nstbrowser, here are some techniques you can use to avoid bots with Puppeteer:
The main way for most bot detectors to detect is by examining IP. Web servers can derive patterns from IP addresses by maintaining a log of each request.
They use Web Application Firewalls (WAFs) to track and block IP address activity and blacklist suspect IPs. Repeated and programmed requests to the server can damage IP reputation and lead to permanent blocking.
To avoid detection by bots, you can set up proxies using IP rotation or Puppeteer:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
args: [
'--proxy-server=http://your_proxy_ip:your_proxy_port',
// Add any other Chrome flags you need
],
});
const page = await browser.newPage();
// Now Puppeteer will use the proxy specified above
await page.goto('https://example.com');
// Continue with your automation tasks
await browser.close();
})();
--proxy-server=http://your_proxy_ip:your_proxy_port
argument specifies the address and port of the proxy server.Make sure to replace your_proxy_ip
and your_proxy_port
with the IP address and port number of the actual proxy server you are using.
Websites typically check the User-Agent of a request to determine which browser and operating system the request is coming from.
Generally, Puppeteer uses a fixed User-Agent, which makes it easy to be detected. By randomizing the User-Agent, requests will be recognized more likely coming from a different real user.
In addition, the anti-bot also checks the HTTP header to identify bots. These include Accept-Language
, Accept-Encoding
, Cache-Control
, and so on.
The default HTTP headers can also expose the use of automation tools. Randomizing and setting common HTTP headers will help your requests to be authentic.
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const randomUseragent = require('random-useragent'); // Random User-Agent Library
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Define common HTTP headers
const commonHeaders = {
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Cache-Control': 'no-cache',
'Upgrade-Insecure-Requests': '1',
};
// Randomize User-Agent and HTTP headers
const setRandomHeaders = async (page) => {
const userAgent = randomUseragent.getRandom(); // Get random User-Agent
await page.setUserAgent(userAgent);
await page.setExtraHTTPHeaders(commonHeaders);
};
await setRandomHeaders(page);
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', {
get: () => false,
});
});
await page.goto('https://example.com', {
waitUntil: 'networkidle2',
});
navigator.webdriver
Defaultly, Puppeteer sets the navigator.webdriver
property to true
. This exposes the presence of automation tools. By disabling or modifying this property, you can reduce the chances of being detected.
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
});
Using the puppeteer-extra-plugin-stealth
plugin can help Puppeteer avoid being detected as a bot.
This plugin modifies some of the browser's default behavior and characteristics to pretend to be a real user.
First, you need to install the puppeteer-extra and puppeteer-extra-plugin-stealth
plugins:
npm install puppeteer-extra puppeteer-extra-plugin-stealth
Next, you can use these plugins in your code to launch Puppeteer:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({ headless: true });
Repeated logins are always required if you want to scrape data from social media platforms or other sites that require authentication.
This repeated authentication request triggers an alert and the account may be blocked or face a CAPTCHA or JavaScript authentication challenge.
We can avoid this by using cookies. After logging in once, we can collect the login session cookie for reuse in the future.
During web scraping, you will definitely encounter CAPTCHA recognition. Now, you have to take advantage of the CAPTCHA resolution service.
Typically, these services use real users to resolve CAPTCHA, thereby reducing the likelihood of being detected as a bot.
This can ensure to bypass bot detection and also help reduce the overall cost of running a bot.
Nstbrowser easily avoids bot detection with a powerful CAPTCHA Solver.
Start to Use for Free Now!
Do you have any wonderful ideas and doubts about web scraping and Browserless?
Let's see what other developers are sharing on Discord and Telegram!
Real users can't make 500 requests in a minute!
Real users can't have fixed browsing habits and programs either!
So in order to prevent being detected easily by anti-bot, we need to set up delayed input and some randomization operations for the automation program when using Puppeteer. In this way, it mimics a real user, thus reducing the risk of detection to some extent.
await page.type('input[name=username]', 'myUsername', { delay: 100 });
await page.type('input[name=password]', 'myPassword', { delay: 100 });
await page.mouse.move(100, 100);
await page.mouse.click(100, 100);
When running automation tasks with Puppeteer, it is sometimes possible to utilize browser extensions to help bypass some of the bot detection.
These extensions can modify the behavior of the browser to make it appear more like it is being operated by a real user.
Load local extensions:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false, // non-headless mode
args: [
`--disable-extensions-except=/path/to/extension/`, // Load extensions with specified paths
`--load-extension=/path/to/extension/`
]
});
const page = await browser.newPage();
await page.goto('https://example.com');
// Continue executing your code
})();
Change the default Chrome extension path
Puppeteer will use an empty extensions directory by default to emulate Chrome. You can specify a custom user data directory by setting userDataDir
and preload the required extensions in it.
In this article, we discussed
Nstbrowser's RPA solution is one of the best options available for avoiding bot detection, and you can configure and use it completely free of charge.