Back to Blog
Web Scraping
How to Avoid Bot Detection with Puppeteer?
Anti-bot detection is really annoying! Is it possible to use Puppeteer to avoid bot detection? 8 methods are here.
Jul 05, 2024Carlos Rivera

Why Are There Anti-Bots?

  • Protecting website resources and performance

Crawlers and bot programs can make a large number of requests in a short period of time, consuming server resources and causing website performance to degrade or even crashing the website.

Anti-bot systems can help websites manage and limit these requests, thus maintaining the stability and availability of the website.

  • Prevent data theft and misuse

Some bot programs crawl content on websites for unauthorized use, such as content theft and data scraping. Anti-bots can help protect data and content on websites from unauthorized access and misuse.

  • Improved security

Malicious bot programs can be used for a variety of attacks, such as Distributed Denial of Service (DDoS) attacks, brute force password cracking, etc.

An anti-bot system can help identify and block these malicious behaviors, improving the overall security of a website.

  • Protecting user privacy

Some bot programs may try to obtain user's personal information, such as email addresses, contact information, etc.

Anti-robot systems can help protect user privacy and prevent this information from being illegally collected and misused.

  • Improve user experience

When a robot program accesses a website in large numbers, it may affect the speed and experience of normal users.

By limiting bot traffic, websites can ensure a better experience for real users.

  • Prevent ad fraud

Some bot programs simulate users clicking on ads to commit ad fraud, resulting in losses for advertisers.

Anti-bot detection can identify and block these fake clicks, protecting advertisers' interests.

How Do Anti-Bots Work?

Anti-bot identifies and blocks bot traffic through a variety of techniques and methods. Here we mainly talk about 6 common analysis methods:

1. Behavioral analysis

  • Monitors the user's behavioral patterns on the website such as mouse movements, clicks, scrolling, and keyboard inputs. Robot programs are usually unable to simulate natural human behavior.
  • Analyzes the speed and frequency of user requests. Robots typically send requests at non-human speeds, such as a large number of requests per second.

2. Device and environment detection

  • Collects information about the user's browser fingerprint, including browser type, version, operating system, plug-ins, etc. The browser fingerprint of a bot program is usually different from that of a real user.
  • Check the User-Agent field in the request header. Many bot programs use default or abnormal User-Agent values.

3. Challenge-response mechanisms

  • Anti-bot uses CAPTCHA or reCAPTCHA to require users to perform certain tasks (e.g., recognizing objects in pictures) to verify their human identity.
  • Insert hidden fields or links in web pages (Honey Pot), where real users do not interact with these elements and a bot program may trigger these traps, thus revealing their identity.

4. IP and geolocation detection:

  • Uses a list of known malicious IP addresses to block requests from those addresses.
  • Restricts access based on the geographic location of the IP address. For example, only allowing requests from specific countries or regions.

5. Traffic analysis:

  • Monitors and analyzes website traffic patterns to identify abnormal traffic spikes and distribution.
  • Analyze the duration and interaction patterns of user sessions. Sessions of bots are usually short and patterned.

6. Machine learning:

Use machine learning algorithms to analyze and identify behavioral differences between normal users and bots. Machine learning models can continuously learn and adapt to new bot behaviors.

How Do Websites Detect Puppeteers?

Websites can check for specific JavaScript variables on a page that are commonly associated with the use of Puppeteer.

For example, they may look for variable names that contain "puppeteer" or other relevant identifiers.

JavaScript Copy
for (let key in window) {
    if (key.includes('puppeteer') || key.includes('webdriver')) {
        // Detected Puppeteer
    }
}

Puppeteer also modifies browser behavior to automate tasks. As a result, sites may check for the presence and value of a property like navigator.webdriver, or other automation indicator flags to determine if an automation tool is controlling the browser.

This property is typically set to true in Puppeteer.

The Easiest Way to Bypass Bot Detection - Nstbrowser

Want to bypass bot detection quickly? Start using Nstbrowser for free now!
Nstbrowser offers

  • Intelligent IP rotation
  • Premium Proxies
  • CAPTCHA solver

Nstbrowser not only uses real browser fingerprints for web access but also simulates the behavior and habits of real users. Making it unrecognizable to anti-bots.

In addition, to simplify web scraping and automation, Nstbrowser is equipped with powerful website unblocker technology to provide a seamless web access experience.

8 Best Methods to Avoid Bot Detection with Puppeteer

As mentioned above, bot detection has become a major problem for web crawler programs. But don't worry! We can still solve it easily.

Besides using Nstbrowser, here are some techniques you can use to avoid bots with Puppeteer:

Method 1. IP/proxy rotation

The main way for most bot detectors to detect is by examining IP. Web servers can derive patterns from IP addresses by maintaining a log of each request.

They use Web Application Firewalls (WAFs) to track and block IP address activity and blacklist suspect IPs. Repeated and programmed requests to the server can damage IP reputation and lead to permanent blocking.

To avoid detection by bots, you can set up proxies using IP rotation or Puppeteer:

JavaScript Copy
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    args: [
      '--proxy-server=http://your_proxy_ip:your_proxy_port',
      // Add any other Chrome flags you need
    ],
  });
  const page = await browser.newPage();

  // Now Puppeteer will use the proxy specified above
  await page.goto('https://example.com');
  
  // Continue with your automation tasks

  await browser.close();
})();
  • The --proxy-server=http://your_proxy_ip:your_proxy_port argument specifies the address and port of the proxy server.
  • You can add additional Chrome flags (args) as needed.

Make sure to replace your_proxy_ip and your_proxy_port with the IP address and port number of the actual proxy server you are using.

Method 2. Rotating HTTP header information and User-Agent

Websites typically check the User-Agent of a request to determine which browser and operating system the request is coming from.

Generally, Puppeteer uses a fixed User-Agent, which makes it easy to be detected. By randomizing the User-Agent, requests will be recognized more likely coming from a different real user.

In addition, the anti-bot also checks the HTTP header to identify bots. These include Accept-Language, Accept-Encoding, Cache-Control, and so on.

The default HTTP headers can also expose the use of automation tools. Randomizing and setting common HTTP headers will help your requests to be authentic.

JavaScript Copy
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const randomUseragent = require('random-useragent'); // Random User-Agent Library

puppeteer.use(StealthPlugin());

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Define common HTTP headers
  const commonHeaders = {
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Cache-Control': 'no-cache',
    'Upgrade-Insecure-Requests': '1',
  };

  // Randomize User-Agent and HTTP headers
  const setRandomHeaders = async (page) => {
    const userAgent = randomUseragent.getRandom(); // Get random User-Agent
    await page.setUserAgent(userAgent);

    await page.setExtraHTTPHeaders(commonHeaders);
  };

  await setRandomHeaders(page);

  await page.evaluateOnNewDocument(() => {
    Object.defineProperty(navigator, 'webdriver', {
      get: () => false,
    });
  });

  await page.goto('https://example.com', {
    waitUntil: 'networkidle2',
  });

Defaultly, Puppeteer sets the navigator.webdriver property to true. This exposes the presence of automation tools. By disabling or modifying this property, you can reduce the chances of being detected.

JavaScript Copy
await page.evaluateOnNewDocument(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => false });
});

Method 4. Using stealth plug-in

Using the puppeteer-extra-plugin-stealth plugin can help Puppeteer avoid being detected as a bot.

This plugin modifies some of the browser's default behavior and characteristics to pretend to be a real user.

First, you need to install the puppeteer-extra and puppeteer-extra-plugin-stealth plugins:

Bash Copy
npm install puppeteer-extra puppeteer-extra-plugin-stealth

Next, you can use these plugins in your code to launch Puppeteer:

JavaScript Copy
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({ headless: true });

Method 5. Using Cookies

Repeated logins are always required if you want to scrape data from social media platforms or other sites that require authentication.

This repeated authentication request triggers an alert and the account may be blocked or face a CAPTCHA or JavaScript authentication challenge.

We can avoid this by using cookies. After logging in once, we can collect the login session cookie for reuse in the future.

Method 6. Using CAPTCHA Resolution Service

During web scraping, you will definitely encounter CAPTCHA recognition. Now, you have to take advantage of the CAPTCHA resolution service.

Typically, these services use real users to resolve CAPTCHA, thereby reducing the likelihood of being detected as a bot.

This can ensure to bypass bot detection and also help reduce the overall cost of running a bot.

Nstbrowser easily avoids bot detection with a powerful CAPTCHA Solver.
Start to Use for Free Now!

Method 7. Delayed input and randomization

Real users can't make 500 requests in a minute!

Real users can't have fixed browsing habits and programs either!

So in order to prevent being detected easily by anti-bot, we need to set up delayed input and some randomization operations for the automation program when using Puppeteer. In this way, it mimics a real user, thus reducing the risk of detection to some extent.

  • Simulate the speed of human input instead of typing everything immediately:
JavaScript Copy
await page.type('input[name=username]', 'myUsername', { delay: 100 });
await page.type('input[name=password]', 'myPassword', { delay: 100 });
  • Randomize mouse movements, clicks, and scrolling actions:
JavaScript Copy
await page.mouse.move(100, 100);
await page.mouse.click(100, 100);

Method 8. Using browser extensions

When running automation tasks with Puppeteer, it is sometimes possible to utilize browser extensions to help bypass some of the bot detection.

These extensions can modify the behavior of the browser to make it appear more like it is being operated by a real user.

Load local extensions:

  • Download the browser extensions you want to use (such as those for Chrome) locally.
  • Load the extension by specifying the args parameter when starting Puppeteer:
JavaScript Copy
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    headless: false, // non-headless mode
    args: [
      `--disable-extensions-except=/path/to/extension/`, // Load extensions with specified paths
      `--load-extension=/path/to/extension/`
    ]
  });
  const page = await browser.newPage();
  await page.goto('https://example.com');
  // Continue executing your code
})();
  • This will allow you to load and use specific extensions in Puppeteer-controlled browser instances, which can sometimes help bypass bot detection.

Change the default Chrome extension path

Puppeteer will use an empty extensions directory by default to emulate Chrome. You can specify a custom user data directory by setting userDataDir and preload the required extensions in it.

Epilogue

In this article, we discussed

  • Why do websites use anti-bots?
  • How do they work?
  • 8 best ways to avoid bot detection with Puppeteer.

Nstbrowser's RPA solution is one of the best options available for avoiding bot detection, and you can configure and use it completely free of charge.

More