Ubuntu 22.04 is a Debian-based Linux operating system that is a long-term support version (LTS), providing five years of official support and security updates.
It uses a modern GNOME desktop environment, optimizes performance and stability, and contains the latest packages and tools to support new hardware. In addition, Ubuntu 22.04 enhances security and provides stricter default security settings and privacy protection.
Yes!
Ubuntu 22.04 is codenamed Jammy Jellyfish. It was released on 21 April 2022 and is a long-term support release, supported for five years, until April 2027.
Browserless is a powerful browser solution. Based on cloud clusters, Browserless is designed to facilitate complex browser automation tasks, web crawling, and testing in a scalable and efficient manner. It is built on Nstbrowser's comprehensive fingerprint library and provides random fingerprint switching to meet user needs for data collection and automation tasks.
Do you have any wonderful ideas and doubts about web scraping and Browserless?
Let's see what other developers are sharing on Discord and Telegram!
Before using Browserless, you need to configure NodeJS in advance.
Ubuntu 22.04 comes with a Node.js package. Although the default version may not be the latest, it is sufficient to meet basic development needs. We use apt to install it, which is a simple way.
Step 1. Update the local package index.
Before installing Node.js, update the local package index:
sudo apt update
Step 2. Install Node.js.
Install Node.js
using the apt package manager:
sudo apt install nodejs -y
Step 3. Install the node package manager.
npm
is used to manage Node.js modules and packages
sudo apt install npm -y
Step 4. Verify that Node.js
and npm
are installed successfully.
After the installation is complete, you can check the installed Node.js version with the following command:
node -v
npm -v
If the output shows version information, it means the installation is successful.
mkdir nst-browserless && cd nst-browserless
npm init -y
code .
Currently, Browserless only supports Puppeteer and Playwright. In this tutorial, we choose to use a puppeteer. Next, it's time to install the dependencies:
npm i --save puppeteer-core
So far, we have completed the project construction.
Access Browserless
First, write a simple process to verify whether Browserless can be accessed normally:
import puppeteer from "puppeteer-core";
const token = "your token"; // required
const config = {
proxy: 'your proxy', // required
};
const query = new URLSearchParams({
token: token, // required
config: JSON.stringify(config),
});
const browserWSEndpoint = `https://less.nstbrowser.io/connect?${query.toString()}`;
const getBrowser = async () => puppeteer.connect({
browserWSEndpoint,
defaultViewport: null,
});
(async () => {
let browser = null;
await getBrowser()
.then(async (browser) => {
const page = await browser.newPage();
await page.goto("https://nstbrowser.io");
await page.screenshot({ path: "screenshot.png", fullPage: true });
await page.close();
await browser.close();
})
.catch((error) => {
console.log(error);
})
.finally(() => browser?.close());
})()
Being able to generate a screenshot shows that we can use Browserless normally:
Scraping images
Next, we go to the Pixels site to crawl some images:
Find the navigation menu link:
After analysis, we can see that the link we need to click is an a
tag under the li
element with the id menuTopArt
. Next, let's further analyze the element to be crawled:
It shows that the element where the image we want is located in the src
value of an img
element with the class name of searchEngineFeaturedProductImage
lazyloaded.
Modify the code we wrote earlier to complete the crawling work:
import puppeteer from "puppeteer-core";
const token = "your api token";
const config = {
proxy: 'your proxy', // required
};
const query = new URLSearchParams({
token: token, // required
config: JSON.stringify(config),
});
const browserWSEndpoint = `https://less.nstbrowser.io/connect?${query.toString()}`;
const getBrowser = async () => puppeteer.connect({
browserWSEndpoint,
defaultViewport: null,
});
(async () => {
let browser = null;
const pixelsWebsite = "https://pixels.com";
await getBrowser()
.then(async (browser) => {
const page = await browser.newPage();
await page.goto(pixelsWebsite);
await page.waitForSelector("#menuTopArt", { timeout: 30000 });
await page.click("#menuTopArt a"); // click 'Wall Art' menu
await page.waitForSelector(".searchEngineFeaturedProductImage", { timeout: 30000 });
const imageElements = await page.$$('.searchEngineFeaturedProductImage');
for (const imageElement of imageElements) {
const src = await page.evaluate(el => el.src, imageElement);
if (src.includes("Blank.jpg")) { // break if Blank.jgp comes
break
}
console.log(src);
// TODO: Add further processing for the image if needed
}
await page.close();
await browser.close();
})
.catch((error) => {
console.log(error);
})
.finally(() => browser?.close());
})()
Bravo! Here you can find the scraping result:
How wonderful Ubuntu is! In this blog, we learned that:
You can see that the whole process is very light and fast, and the process is also very simple.
Now start to study more ways to play Browserless on your own!