You can always find all relevant and valuable information about products, sellers, reviews, ratings, specials, news, etc. on Amazon. Whether it is a seller doing market research or an individual collecting data, using a high-quality, convenient and fast tool will help you to accurately crawl various information on Amazon to a great extent.
Amazon gathers valuable information in one place: products, reviews, ratings, exclusive offers, news, etc. Therefore, data scraping on Amazon will largely avoid time-consuming and labor-intensive problems. As a business, using Amazon product scraper can bring you at least the following 4 significant benefits:
Headless browsers are excellent in performing automated work? That's right, we will use Nstbrowser's most powerful headless browser service: Browserless to crawl Amazon product information.
When crawling Amazon product data, we always encounter a series of severe challenges such as robot detection, verification code recognition, and IP blocking. Using Browserless can fully avoid these headaches!
Nstbrowser's Browserless provides real user browser fingerprints, and each fingerprint is unique. In addition, participating in our subscription plan can achieve full captcha bypass, escorting your unimpeded access experience. Join our Discord referral program to share $1,500 in cash now!
Without further ado, let's now officially start using Browserless for data crawling!
Before we start, we need to connect to the Browserless service. Using Browserless can solve complex web scraping and large-scale automation tasks, and you can really enjoy the fully managed cloud deployment.
Browserless adopts a browser-centric concept, provides powerful headless deployment capabilities, and provides higher performance and reliability. For more information about Browserless, you can refer to our relevant documentation.
Get the API KEY and go to the Browserless menu page of the Nstbrowser client, or you can click here to access it directly.
# pnpm
pnpm i puppeteer-core
# yarn
yarn add puppeteer-core
# npm
npm i --save puppeteer-core
const apiKey = "your ApiKey"; // required
const config = {
proxy: 'your proxy', // required; input format: schema://user:password@host:port eg: http://user:password@localhost:8080
// platform: 'windows', // support: windows, mac, linux
// kernel: 'chromium', // only support: chromium
// kernelMilestone: '128', // support: 128
// args: {
// "--proxy-bypass-list": "detect.nstbrowser.io"
// }, // browser args
// fingerprint: {
// userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.6613.85 Safari/537.36', // userAgent supportted since v0.15.0
// },
};
const query = new URLSearchParams({
token: apiKey, // required
config: JSON.stringify(config),
});
const browserlessWSEndpoint = `https://less.nstbrowser.io/connect?${query.toString()}`;
Before crawling, we can try to go to https://www.amazon.com/. If it is the first time visiting, there is a high probability that a verification code will appear:
But it doesn't matter, we don't have to go to great lengths to find a verification code decoding tool. At this time, you only need to visit the Amazon domain name in your region or the region of your proxy, and the verification code will not be triggered.
For example, let's visit: https://www.amazon.co.uk/: the Amazon domain name in the UK. We can see the page smoothly, then try to enter the product keyword we want in the top search bar or directly visit through the URL, like:
https://www.amazon.co.uk/s?k=shirt
The value after /s?k=
in the URL is the keyword of the product. By accessing the above URL, you will see shirt-related products on Amazon. Now you can open the "Developer Tools" (F12) to check the HTML structure of the page and confirm the data we need to crawl later by positioning the cursor.
First, I added a string of code at the top of the script. The following code uses the first script parameter as the Amazon product keyword, and subsequent scripts will also use this parameter to crawl:
const productName = process.argv.slice(2);
if (productName.length !== 1) {
console.error('product name CLI arguments missing!');
process.exit(2);
}
Next, we need to:
import puppeteer from "puppeteer-core";
const browser = await puppeteer.connect({
browserWSEndpoint: browserlessWSEndpoint,
defaultViewport: null,
})
console.info('Connected!');
const page = await browser.newPage();
await page.goto(`https://www.amazon.co.uk/s?k=${productName}`);
// Add screenshots to facilitate subsequent troubleshooting
await page.screenshot({ path: 'amazon_page.png' })
Now we use page.$$
to get a list of all products, loop through the product list, and get the relevant data one by one in the loop. Then collect this data into the productDataList
array and print it:
// Get the container element of all search results
const productContainers = await page.$$('div[data-component-type="s-search-result"]')
const productDataList = []
// Get various information about the product: title, rating, image link, price
for (const product of productContainers) {
async function safeEval(selector, evalFn) {
try {
return await product.$eval(selector, evalFn);
} catch (e) {
return null;
}
}
const title = await safeEval('.s-title-instructions-style > h2 > a > span', node => node.textContent)
const rate = await safeEval('a > i.a-icon.a-icon-star-small > span', node => node.textContent)
const img = await safeEval('span[data-component-type="s-product-image"] img', node => node.getAttribute('src'))
const price = await safeEval('div[data-cy="price-recipe"] .a-offscreen', node => node.textContent)
productDataList.push({ title, rate, img, price })
}
console.log('amazon_product_data_list :', productDataList);
await browser.close();
Running the script:
node amazon.mjs shirt
If successful, the following will be printed on the console:
Obviously, in order to better analyze the data, it is not enough to just print the data in the console. Here is a simple example: quickly convert a JS object to a JSON file through the fs module
:
import fs from 'fs'
function saveObjectToJson(obj, filename) {
const jsonString = JSON.stringify(obj, null, 2)
fs.writeFile(filename, jsonString, 'utf8', (err) => {
err ? console.error(err) : console.log(`File saved successfully: ${filename}`);
});
}
saveObjectToJson(productDataList, 'amazon_product_data.json')
Ok, let's take a look at our complete code:
import puppeteer from "puppeteer-core";
import fs from 'fs'
const productName = process.argv.slice(2);
if (productName.length !== 1) {
console.error('product name CLI arguments missing!');
process.exit(2);
}
const apiKey = "your ApiKey"; // 'your proxy'
const config = {
proxy: 'your proxy', // required; input format: schema://user:password@host:port eg: http://user:password@localhost:8080
// platform: 'windows', // support: windows, mac, linux
// kernel: 'chromium', // only support: chromium
// kernelMilestone: '128', // support: 128
// args: {
// "--proxy-bypass-list": "detect.nstbrowser.io"
// }, // browser args
// fingerprint: {
// userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.6613.85 Safari/537.36', // userAgent supportted since v0.15.0
// },
};
const query = new URLSearchParams({
token: apiKey, // required
config: JSON.stringify(config),
});
const browserlessWSEndpoint = `https://less.nstbrowser.io/connect?${query.toString()}`;
const browser = await puppeteer.connect({
browserWSEndpoint: browserlessWSEndpoint,
defaultViewport: null,
})
console.info('Connected!');
const page = await browser.newPage();
await page.goto(`https://www.amazon.co.uk/s?k=${productName}`);
// Add screenshots to facilitate subsequent troubleshooting
await page.screenshot({ path: 'amazon_page.png' })
// Get the container element of all search results
const productContainers = await page.$$('div[data-component-type="s-search-result"]')
const productDataList = []
// Get various information about the product: title, rating, image link, price
for (const product of productContainers) {
async function safeEval(selector, evalFn) {
try {
return await product.$eval(selector, evalFn);
} catch (e) {
console.log(`Error fetching ${selector}:`, e);
return null;
}
}
const title = await safeEval('.s-title-instructions-style > h2 > a > span', node => node.textContent)
const rate = await safeEval('a > i.a-icon.a-icon-star-small > span', node => node.textContent)
const img = await safeEval('span[data-component-type="s-product-image"] img', node => node.getAttribute('src'))
const price = await safeEval('div[data-cy="price-recipe"] .a-offscreen', node => node.textContent)
productDataList.push({ title, rate, img, price })
}
function saveObjectToJson(obj, filename) {
const jsonString = JSON.stringify(obj, null, 2)
fs.writeFile(filename, jsonString, 'utf8', (err) => {
err ? console.error(err) : console.log(`File saved successfully: ${filename}`);
});
}
saveObjectToJson(productDataList, 'amazon_product_data.json')
console.log('amazon_product_data_list :', productDataList);
await browser.close();
Now, after running the script, you can not only see the console print, but also theamazon_product_data.json
file written under the current path.
You can view statistics for recent requests and remaining session time in the Browserless menu of the Nstbrowser client.
Using RPA tools to crawl web data is a common method of data collection. Using RPA tools can greatly improve the efficiency of data collection and reduce the cost of collection. Nstbrowser RPA function can provide you with the best RPA experience and the best work efficiency.
After reading this tutorial, you will:
First, you need to have an Nstbrowser account, and then log in to the Nstbrowser client, enter the workflow page of the RPA module, and click New Workflow.
Now, we can start configuring the RPA crawling workflow based on Amazon product search results.
Goto Url
node, configure the website URL, and you can visit the target website:This time we will not use the method of querying the corresponding product through the URL, but use RPA to help enter the input box on the homepage and then trigger the query jump. This will not only make us more familiar with the operation of RPA, but also avoid site risk control to a greater extent.
Okay, after reaching the target website, we need to search for the target address first. Here we need to use the Chrome Devtool tool to locate the HTML element.
Add Input Content
node:
id
we are located in the input boxIn this way, we have completed the action of entering the input box.
Keyboard
node to simulate the keyboard's enter action to search for products:Because the search page will jump to a new page, we need to add a waiting action to ensure that we have successfully loaded the result page. Nstbrowser RPA provides two waiting behaviors: Wait Time
and Wait Request
.
Wait Time
: used to wait for a period of time. You can choose a fixed time or a random time according to your specific situation.Wait Request
: used to wait for the network request to end. Applicable to the situation of obtaining data through network requests.Okay, now we can successfully see the new product search page, and the next step is to crawl these contents.
By observation, we can find that Amazon's search results are displayed in the form of a card list. This is a very classic display method:
Similarly, open the Devtool tool and locate each data in the card:
Because each item in the card list is an HTML element, we need to use the Loop Element
node to traverse all the query results. We fill in the CSS selector of the product
list in Selector and select Element Object
for Data Type, which means getting the target element and saving it as an element object to a variable. Set the variable name to product
through Data Save Variable, and save the index as productIndex
.
Next, we need to process each traversed element and get the information we need from the product. We get the title element of the product. Here we need to use the Get Element Data
node to get it and finally save it as the variable title.
Select Children
as Data Type, which means getting the child elements of the target element and saving it as an element object in the variable title. You need to fill in the element selector of the child element. The CSS selector entered here is naturally the CSS selector of the product title:
Then we use the same method to convert the remaining product information: ratings, image links, and prices, all into RPA processes.
'title' .s-title-instructions-style > h2 > a > span
'rate' a > i.a-icon.a-icon-star-small > span
'img' span[data-component-type="s-product-image"] img
'price' div[data-cy="price-recipe"] .a-offscreen
However, the variable data obtained above are actually HTML elements. We still need to process them to output the text in the HTML elements and prepare for subsequent data storage.
We need to
Add the Get Element Data
node again to output the variables obtained above as text and save them to the table variable for subsequent data storage. Select Data Type as Text
to get the innerText of the target element. (The figure below shows the processing of the variable title)
Then we use the same method to convert the product's rating and price into the final text information.
The image link needs additional processing. Here we use the javascript
node to get the image src of the current traversed product. Note that the index variable productIndex
saved by the Loop Element
node needs to be injected into the script and finally saved as the variable imgSrc.
return document.querySelectorAll('[data-image-latency="s-product-image"]')[productIndex].getAttribute('src')
Finally, we use the Set Variable node to store the variable imgSrc in the table:
At this point, we have obtained all the data we want to collect, and it is time to save this data.
Save To File
and Save To Excel
.Save To File
provides three file types for you to choose from .txt, .CSV, and .JSON.Save To Excel
can only save data to Excel files.For easy viewing, we choose to save the collected data to Excel. Add the Save To Excel
node, configure the file path and file name to be saved, select the table content to be saved, and you are done!
Save the workflow we configured first, then you can run it directly on the current page, or return to the previous page, create new tasks, and click the Run button to run it. At this point, we can start collecting Amazon's product data!
After the execution is completed, you can see the amazon-product-data.xlsx file generated on the desktop.
The easiest way to scrape Amazon products is to build your own Amazon product scraper using Browserless. This most comprehensive tutorial article of 2024 clearly explains to you:
Are you particularly interested in web data? Please check out our RPA marketplace. Nstbrowser has prepared 20 powerful RPA programs that can solve all your problems in all aspects.
If you have special needs for Browserless, data scraping, or automation, please contact us in time. We are ready to provide you with high-quality customized services.
Disclaimer: Any data and websites mentioned in this article are for demonstration purposes only. We firmly oppose illegal and infringing activities. If you have any questions or concerns, please contact us promptly.