Back to Blog
Cloudflare Bypass
How does Cloudflare detect bots? Useful Cloudflare Bypass Service
Cloudflare is a web performance and security company. On the security side, they offer customers a Web Application Firewall (WAF).
Apr 10, 2024

Cloudflare is a web performance and security company. On the security side, they offer customers a Web Application Firewall (WAF). A WAF can defend applications against several security threats, such as cross-site scripting (XSS), credential stuffing, and DDoS attacks.

One of the core systems included in their WAF is Cloudflare's Bot Manager. As a bot protection solution, its main goal is to mitigate attacks from malicious bots without impacting real users.

Cloudflare acknowledges the importance of certain bots. For example, no site wants to deliberately block Google or other search engines from crawling its webpage. To account for this, Cloudflare maintains an allowlist for known good bots.

Unfortunately for web-scraping enthusiasts like you and me, they also assume all non-whitelisted bot traffic is malicious. So, regardless of your intent, there's a good chance your bot gets denied access to a Cloudflare-protected web page.

If you've tried to scrape a Cloudflare-protected site before, you may have run into a few of the following Bot-manager related errors:

  • Error 1010: The owner of this website has banned your access based on your browser's signature
  • Error 1012: Access Denied
  • Error 1015: You are being rate limited
  • Error 1020: Access Denied

Typically, these challenges are accompanied by a Cloudflare 403 Forbidden HTTP response status code.

How does Cloudflare detect bots?

The bot detection methods used by Cloudflare can generally be classified into two categories: passive and active. Passive bot detection techniques consist of fingerprinting checks performed on the backend, while active detection techniques rely on checks performed on the client side. Let's dive into a few examples from each category together!

Cloudflare passive bot detection techniques

Here's a non-exhaustive list of some passive bot detection techniques Cloudflare employs:

Detecting botnets

Cloudflare maintains a catalog of devices, IP addresses, and behavioral patterns known to be associated with malicious bot networks. Any device suspected to belong to one of these networks is either automatically blocked or faced with additional client-side challenges to solve.

IP address reputation

A user's IP address reputation (also known as risk score or fraud score) is based on factors such as geolocation, ISP, and reputation history. For example, IPs belonging to a data center or known VPN provider will have a worse reputation than a residential IP address. A site may also choose to limit access to a site from regions outside of the area they serve since traffic from an actual customer should never come from there.

HTTP request headers

Cloudflare uses HTTP request headers to determine if you're a robot. If you have a non-browser user agent, such as python-requests/2.22.0, your scraper can easily be picked out as a bot. Cloudflare can also block your bot if it sends a request that is missing headers that would otherwise be there in a browser. Or if you have mismatching headers based on your user-agent. For example, including a sec-ch-ua-full-version-list: header for a Firefox user-agent.

TLS fingerprinting

This technique enables Cloudflare's antibot to identify the client being used to send requests to a server.
Though there are multiple methods of fingerprinting TLS (such as JA3, JARM, and CYU), each implementation produces a fingerprint that is static per request client. TLS fingerprinting is helpful because the TLS implementation of a browser tends to differ from that of other release versions, other browsers, and request-based libraries. For example, a Chrome browser on Windows (version 104) would have a different fingerprint than all of the following:

  • A Chrome browser on Windows (version 87)
  • A Firefox browser
  • A Chrome browser on an android device
  • The Python HTTP requests library

The construction of a TLS fingerprint happens during the TLS Handshake. Cloudflare analyzes the fields provided in the 'client hello' message, such as cipher suites, extensions, and elliptic curves, to compute a fingerprint hash for a given client.

Next, that hash is looked up in a database of pre-collected fingerprints to determine the client the request came from. Suppose the client's hash matches an allowed fingerprint hash (i.e., a browser's fingerprint). In that case, Cloudflare will then compare the user-agent header from the client's request to the user-agent associated with the stored fingerprint hash.

If they match, the security system assumes that the request originated from a standard browser. On the contrary, a mismatch between a client's TLS fingerprint and its advertised user-agent indicates obvious use of custom botting software, resulting in the request being blocked.

HTTP/2 fingerprinting

The HTTP/2 specification is the second major HTTP protocol version, published on May 14, 2015, as RFC 7540. The protocol is supported by all major browsers.

The main goal of HTTP/2 was to improve the performance of websites and web applications by introducing header field compression and allowing concurrent requests and responses on the same TCP connection. To accomplish this, HTTP/1.1's foundation was expanded with new parameters and values. These new internals are what the HTTP/2 fingerprint is based on.

The binary framing layer is a new addition to HTTP/2 and is the central focus of an HTTP/2 fingerprint.

If you're interested in a more in-depth analysis of HTTP/2 fingerprinting, you should read Akamai's proposed method for fingerprinting HTTP2 clients here: Passive Fingerprinting of HTTP/2 Clients. But for now, here's a summary:

Three main components form an HTTP/2 fingerprint:

  • Frames: SETTINGS_HEADER_TABLE_SIZE, SETTINGS_ENABLE_PUSH, SETTINGS_MAX_CONCURRENT_STREAMS, SETTINGS_INITIAL_WINDOW_SIZE, SETTINGS_MAX_FRAME_SIZE, SETTINGS_MAX_HEADER_LIST_SIZE, WINDOW_UPDATE
  • Stream Priority Information: StreamID:Exclusivity_Bit:Dependant_StreamID:Weight
  • Pseudo Header Fields Order: The order of the :method, :authority, :scheme, and :path headers.

If you're curious, you can test a live HTTP/2 fingerprinting demo by clicking here.
Like TLS fingerprinting, each request client will have a static HTTP/2 fingerprint. To determine a request's legitimacy, Cloudflare always verifies that the fingerprint and user-agent pair from the request matches a whitelisted one stored in their database.

HTTP/2 fingerprinting and TLS fingerprinting go hand in hand. Out of all the passive bot detection techniques Cloudflare uses, these two are the most technically challenging to control in a request-based bot. However, they're also the most important. So, you want to ensure you do them right or risk getting blocked!
Alright! By now, you should have a good understanding of how Cloudflare detects bots passively. But, remember: that's only half of the story. Now, let's take a look at how they do it actively!

Cloudflare active bot detection techniques

When you visit a Cloudflare-protected website, many checks are constantly running on the client-side (i.e., in your local browser) to determine if you're a robot. Here's a list of some methods they use (once again, non-exhaustive):

CAPTCHAs

In the past, CAPTCHAs were the go-to method for detecting bots. However, it's well-known that they harm the end user's experience. Whether or not Cloudflare serves the user a captcha is dependent on several factors, such as:

  • The site configuration. A website administrator may choose to enable CAPTCHAs all the time, sometimes, or never at all.
  • Risk Level. Cloudflare may choose to serve a CAPTCHA only if the traffic is suspicious. For example, a CAPTCHA may be shown if a user browses a site using the Tor client, but not if the user runs a standard web browser like Google Chrome. For these cases, a Cloudflare CAPTCHA bypass is possible and we'll see how below.

Previously, Cloudflare used reCAPTCHA as their primary captcha provider. But, since 2020, they've migrated to use hCaptcha exclusively. Below is an example of hCaptcha appearing on a Cloudflare-protected site:

Canvas fingerprinting

Canvas fingerprinting allows a system to identify the device class of a web client. A device class refers to the combination of browser, operating system, and graphics hardware of the system used to access the webpage.

Canvas is an HTML5 API used to draw graphics and animations on a web page using JavaScript. To construct a canvas fingerprint, a webpage queries your browser's canvas API to render an image. That image is then hashed to produce a fingerprint.

This technique relies on taking a system's graphic rendering system as a physically unclonable function. That might sound complicated, so let me explain it.

A canvas fingerprint depends on multiple layers of the computing system, such as:

  • Hardware. GPU
  • Low-level Software. GPU driver, Operating system (fonts, anti-aliasing/sub-pixel rendering algorithms)
  • High-Level Software Web Browser (image processing engine)
    Because a variation in any of these categories will produce a unique fingerprint, this technique accurately differentiates between device classes.

I want to clarify this: a canvas fingerprint doesn't contain enough information to sufficiently track and identify unique individuals or bots. Instead, its main purpose is to distinguish between device classes accurately.

In the context of bot detection, this is useful because bots tend to lie about their underlying technology (via their user-agent header). Cloudflare has a large dataset of legitimate canvas fingerprints + user agent pairs. Using machine learning, they can detect device property spoofing (ex. user-agent, operating system, or GPU) by looking for a mismatch between your canvas fingerprint and the expected one.

Cloudflare uses a specific canvas fingerprinting method, Google's Picasso Fingerprinting.

If you'd like to see canvas fingerprinting in action, check out Browserleak's live demo.

Event tracking

Cloudflare adds event listeners to webpages. These listen for user actions, such as mouse movements, mouse clicks, or key presses. Most of the time, a real user will need to use their mouse or keyboard to browse. If Cloudflare sees a consistent lack of mouse or keyboard usage, they can assume the user is a bot.

Environment API querying

This is a very broad category. A browser has hundreds of Web APIs that can be used for bot detection. I'll do my best to split them up into 4 categories:

  1. Browser-specific APIs. These specifications exist in one browser but may not exist in another. For example, window.chrome is a property that only exists in a Chrome browser. If the data you send Cloudflare indicates that you're using a Chrome browser but send it with a Firefox user agent, they'll know something is up.
  2. Timestamp APIs. Cloudflare makes use of timestamp APIs, such as Date.now() or window.performance.timing.navigationStart to keep track of a user's speed metrics. A user will be blocked if timestamps don't appear like ordinary human browsing activity. Some examples include: browsing inhumanly quickly or mismatching timestamps (such as a navigationStart timestamp from before the page was loaded).
  3. Automated Browser Detection. Cloudflare queries the browser for properties that only exist in automated web browser environments. For example, the existence of the window.document.__selenium_unwrapped or window.callPhantom property indicates the usage of Selenium and PhantomJS, respectively. For obvious reasons, you're getting blocked if this is detected.
  4. Sandboxing Detection. For our purposes, sandboxing refers to an attempt at emulating a browser in a non-browser environment. Cloudflare has checks to stop people from trying to solve its challenges with emulated browser environments, such as in NodeJS using JSDOM. For example, the script may look for the process object, which only exists in NodeJS. They also can detect if functions have been modified by using Function.prototype.toString.call(functionName) on the function in question.

Nstbrowser

Nstbrowser is a Cloudflare bypass API that is automated solution for Cloudflare's Turnstile & Captcha challenges. Stop worrying about the intricacies of detection techniques, dynamic obfuscation, challenge solving, or updates. Offering both API and proxy modes, Nstbrowser can be seamlessly integrated into any of your scraping projects. Focus on your data scraping vision, and let Nstbrowser handle the rest.
You can check here to try the Cloudflare bypass service: https://www.nstbrowser.io/cloudflare

Conclusion

In conclusion, Cloudflare's bot detection mechanisms represent a sophisticated and multifaceted approach to safeguarding websites against malicious bot activity. By combining passive and active detection techniques, Cloudflare is able to effectively differentiate between human users and automated bots, thereby preserving the integrity and security of online platforms. However, the cat-and-mouse game between bot operators and security providers continues, with each side continuously innovating and adapting their strategies. As bot technology evolves, so too must the defenses employed by organizations like Cloudflare to stay one step ahead of malicious actors.

More