Cloudflare is a web performance and security company. On the security side, they offer customers a Web Application Firewall (WAF). A WAF can defend applications against several security threats, such as cross-site scripting (XSS), credential stuffing, and DDoS attacks.
One of the core systems included in their WAF is Cloudflare's Bot Manager. As a bot protection solution, its main goal is to mitigate attacks from malicious bots without impacting real users.
Cloudflare acknowledges the importance of certain bots. For example, no site wants to deliberately block Google or other search engines from crawling its webpage. To account for this, Cloudflare maintains an allowlist for known good bots.
Unfortunately for web-scraping enthusiasts like you and me, they also assume all non-whitelisted bot traffic is malicious. So, regardless of your intent, there's a good chance your bot gets denied access to a Cloudflare-protected web page.
If you've tried to scrape a Cloudflare-protected site before, you may have run into a few of the following Bot-manager related errors:
Typically, these challenges are accompanied by a Cloudflare 403 Forbidden HTTP response status code.
The bot detection methods used by Cloudflare can generally be classified into two categories: passive and active. Passive bot detection techniques consist of fingerprinting checks performed on the backend, while active detection techniques rely on checks performed on the client side. Let's dive into a few examples from each category together!
Here's a non-exhaustive list of some passive bot detection techniques Cloudflare employs:
Cloudflare maintains a catalog of devices, IP addresses, and behavioral patterns known to be associated with malicious bot networks. Any device suspected to belong to one of these networks is either automatically blocked or faced with additional client-side challenges to solve.
A user's IP address reputation (also known as risk score or fraud score) is based on factors such as geolocation, ISP, and reputation history. For example, IPs belonging to a data center or known VPN provider will have a worse reputation than a residential IP address. A site may also choose to limit access to a site from regions outside of the area they serve since traffic from an actual customer should never come from there.
Cloudflare uses HTTP request headers to determine if you're a robot. If you have a non-browser user agent, such as python-requests/2.22.0
, your scraper can easily be picked out as a bot. Cloudflare can also block your bot if it sends a request that is missing headers that would otherwise be there in a browser. Or if you have mismatching headers based on your user-agent. For example, including a sec-ch-ua-full-version-list
: header for a Firefox user-agent.
This technique enables Cloudflare's antibot to identify the client being used to send requests to a server.
Though there are multiple methods of fingerprinting TLS (such as JA3, JARM, and CYU), each implementation produces a fingerprint that is static per request client. TLS fingerprinting is helpful because the TLS implementation of a browser tends to differ from that of other release versions, other browsers, and request-based libraries. For example, a Chrome browser on Windows (version 104) would have a different fingerprint than all of the following:
The construction of a TLS fingerprint happens during the TLS Handshake. Cloudflare analyzes the fields provided in the 'client hello' message, such as cipher suites, extensions, and elliptic curves, to compute a fingerprint hash for a given client.
Next, that hash is looked up in a database of pre-collected fingerprints to determine the client the request came from. Suppose the client's hash matches an allowed fingerprint hash (i.e., a browser's fingerprint). In that case, Cloudflare will then compare the user-agent header from the client's request to the user-agent associated with the stored fingerprint hash.
If they match, the security system assumes that the request originated from a standard browser. On the contrary, a mismatch between a client's TLS fingerprint and its advertised user-agent indicates obvious use of custom botting software, resulting in the request being blocked.
The HTTP/2 specification is the second major HTTP protocol version, published on May 14, 2015, as RFC 7540. The protocol is supported by all major browsers.
The main goal of HTTP/2 was to improve the performance of websites and web applications by introducing header field compression and allowing concurrent requests and responses on the same TCP connection. To accomplish this, HTTP/1.1's foundation was expanded with new parameters and values. These new internals are what the HTTP/2 fingerprint is based on.
The binary framing layer is a new addition to HTTP/2 and is the central focus of an HTTP/2 fingerprint.
If you're interested in a more in-depth analysis of HTTP/2 fingerprinting, you should read Akamai's proposed method for fingerprinting HTTP2 clients here: Passive Fingerprinting of HTTP/2 Clients. But for now, here's a summary:
Three main components form an HTTP/2 fingerprint:
SETTINGS_HEADER_TABLE_SIZE, SETTINGS_ENABLE_PUSH, SETTINGS_MAX_CONCURRENT_STREAMS, SETTINGS_INITIAL_WINDOW_SIZE, SETTINGS_MAX_FRAME_SIZE, SETTINGS_MAX_HEADER_LIST_SIZE, WINDOW_UPDATE
_Bit:Dependant_
StreamID:Weight:method
, :authority
, :scheme
, and :path
headers.If you're curious, you can test a live HTTP/2 fingerprinting demo by clicking here.
Like TLS fingerprinting, each request client will have a static HTTP/2 fingerprint. To determine a request's legitimacy, Cloudflare always verifies that the fingerprint and user-agent pair from the request matches a whitelisted one stored in their database.
HTTP/2 fingerprinting and TLS fingerprinting go hand in hand. Out of all the passive bot detection techniques Cloudflare uses, these two are the most technically challenging to control in a request-based bot. However, they're also the most important. So, you want to ensure you do them right or risk getting blocked!
Alright! By now, you should have a good understanding of how Cloudflare detects bots passively. But, remember: that's only half of the story. Now, let's take a look at how they do it actively!
When you visit a Cloudflare-protected website, many checks are constantly running on the client-side (i.e., in your local browser) to determine if you're a robot. Here's a list of some methods they use (once again, non-exhaustive):
In the past, CAPTCHAs were the go-to method for detecting bots. However, it's well-known that they harm the end user's experience. Whether or not Cloudflare serves the user a captcha is dependent on several factors, such as:
Previously, Cloudflare used reCAPTCHA as their primary captcha provider. But, since 2020, they've migrated to use hCaptcha exclusively. Below is an example of hCaptcha appearing on a Cloudflare-protected site:
Canvas fingerprinting allows a system to identify the device class of a web client. A device class refers to the combination of browser, operating system, and graphics hardware of the system used to access the webpage.
Canvas is an HTML5 API used to draw graphics and animations on a web page using JavaScript. To construct a canvas fingerprint, a webpage queries your browser's canvas API to render an image. That image is then hashed to produce a fingerprint.
This technique relies on taking a system's graphic rendering system as a physically unclonable function. That might sound complicated, so let me explain it.
A canvas fingerprint depends on multiple layers of the computing system, such as:
I want to clarify this: a canvas fingerprint doesn't contain enough information to sufficiently track and identify unique individuals or bots. Instead, its main purpose is to distinguish between device classes accurately.
In the context of bot detection, this is useful because bots tend to lie about their underlying technology (via their user-agent header). Cloudflare has a large dataset of legitimate canvas fingerprints + user agent pairs. Using machine learning, they can detect device property spoofing (ex. user-agent, operating system, or GPU) by looking for a mismatch between your canvas fingerprint and the expected one.
Cloudflare uses a specific canvas fingerprinting method, Google's Picasso Fingerprinting.
If you'd like to see canvas fingerprinting in action, check out Browserleak's live demo.
Cloudflare adds event listeners to webpages. These listen for user actions, such as mouse movements, mouse clicks, or key presses. Most of the time, a real user will need to use their mouse or keyboard to browse. If Cloudflare sees a consistent lack of mouse or keyboard usage, they can assume the user is a bot.
This is a very broad category. A browser has hundreds of Web APIs that can be used for bot detection. I'll do my best to split them up into 4 categories:
window.chrome
is a property that only exists in a Chrome browser. If the data you send Cloudflare indicates that you're using a Chrome browser but send it with a Firefox user agent, they'll know something is up.Date.now()
or window.performance.timing.navigationStart
to keep track of a user's speed metrics. A user will be blocked if timestamps don't appear like ordinary human browsing activity. Some examples include: browsing inhumanly quickly or mismatching timestamps (such as a navigationStart
timestamp from before the page was loaded).window.document.__selenium_unwrapped
or window.callPhantom
property indicates the usage of Selenium and PhantomJS, respectively. For obvious reasons, you're getting blocked if this is detected.process object
, which only exists in NodeJS. They also can detect if functions have been modified by using Function.prototype.toString.call(functionName)
on the function in question.Nstbrowser is a Cloudflare bypass API that is automated solution for Cloudflare's Turnstile & Captcha challenges. Stop worrying about the intricacies of detection techniques, dynamic obfuscation, challenge solving, or updates. Offering both API and proxy modes, Nstbrowser can be seamlessly integrated into any of your scraping projects. Focus on your data scraping vision, and let Nstbrowser handle the rest.
You can check here to try the Cloudflare bypass service: https://www.nstbrowser.io/cloudflare
In conclusion, Cloudflare's bot detection mechanisms represent a sophisticated and multifaceted approach to safeguarding websites against malicious bot activity. By combining passive and active detection techniques, Cloudflare is able to effectively differentiate between human users and automated bots, thereby preserving the integrity and security of online platforms. However, the cat-and-mouse game between bot operators and security providers continues, with each side continuously innovating and adapting their strategies. As bot technology evolves, so too must the defenses employed by organizations like Cloudflare to stay one step ahead of malicious actors.