This guide focuses on web scraping with Java in 2024. It provides a step-by-step tutorial on how to extract data from websites using the Scrapeme site as an example. By utilizing Java and the jsoup library, you will learn how to scrape static web page resources and retrieve specific information such as product names, images, prices, and details.
This guide equips you with the necessary skills to scrape data from similar websites and serves as a foundation for more advanced web scraping techniques.
Get ready to harness the power of Java for web scraping in this modern era!
It is important to note that since we will be using Java for the demo project in this article, please make sure you have the following prerequisites in place before proceeding:
Note: The installation process for the environment is omitted.
JDK 21
# java -version
java version "21.0.2" 2024-01-16 LTS
Java(TM) SE Runtime Environment (build 21.0.2+13-LTS-58)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.2+13-LTS-58, mixed mode, sharing)
Build tool: Gradle
# gradle -version
Gradle 8.7
Build time: 2024-03-22 15:52:46 UTC
Revision: 650af14d7653aa949fce5e886e685efc9cf97c10
Kotlin: 1.9.22
Groovy: 3.0.17
Ant: Apache Ant(TM) version 1.10.13 compiled on January 4 2023
JVM: 21.0.2 (Oracle Corporation 21.0.2+13-LTS-58)
OS: Mac OS X 14.4.1 aarch64
IDE: IntelliJ IDEA
Project Information
After creation, your project may look like this:
Dependencies are added, and for the time being, jsoup
will suffice (jsoup: Java HTML Parser):
// gradle => build.gradle => dependencies
implementation 'org.jsoup:jsoup:1.17.2'
Now it's the right time to scrape the website! Here I will crawl the ScrapeMe, just as a reference. You can go through all the progress and then finish your project.
First, let's take a look at the data we want to scrape from the Scrapeme site. Open the site in a browser and view the source code to analyze the target elements. Then, retrieve the desired elements through code.
ScrapeMe Homepage Product List:
ScrapeMe Product Details:
Our goal is to crawl the home page of these products information, including product name, product images, product price, and product details address.
Page Elements
By looking at the page element analysis, we know that the current page has all the product page elements: ul.products
, and each product detail element: li.product
.
Product Details
Further analysis finds product name: a h2
, product image: a img.src
, product price: a span
, product details address: a.href
.
ScrapeMeProduct.java
public class ScrapeMeProduct {
/**
* product detail url
*/
private String url;
/**
* product image
*/
private String image;
/**
* product name
*/
private String name;
/**
* product price
*/
private String price;
// Getters and Setters
@Override
public String toString() {
return "{ \"url\":\"" + url + "\", "
+ " \"image\": \"" + image + "\", "
+ "\"name\":\"" + name + "\", "
+ "\"price\": \"" + price + "\" }";
}
}
Scraper.java
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class Scraper {
// scrapeme site url
private static final String SCRAPEME_SITE_URL = "https://scrapeme.live/shop";
public static List<ScrapeMeProduct> scrape() {
// html doc for scrapeme page
Document doc;
// products data
List<ScrapeMeProduct> pokemonProducts = new ArrayList<>();
try {
doc = Jsoup.connect(SCRAPEME_SITE_URL)
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36") // mock userAgent header
.header("Accept-Language", "*") // mock Accept-Language header
.get();
// select product nodes
Elements products = doc.select("li.product");
for (Element product : products) {
ScrapeMeProduct pokemonProduct = new ScrapeMeProduct();
pokemonProduct.setUrl(product.selectFirst("a").attr("href")); // parse and set product url
pokemonProduct.setImage(product.selectFirst("img").attr("src")); // parse and set product image
pokemonProduct.setName(product.selectFirst("h2").text()); // parse and set product name
pokemonProduct.setPrice(product.selectFirst("span").text()); // parse and set product price
pokemonProducts.add(pokemonProduct);
}
} catch (IOException e) {
throw new RuntimeException(e);
}
return pokemonProducts;
}
}
Main.java
import io.xxx.basic.ScrapeMeProduct;
import io.xxx.basic.Scraper;
import java.util.List;
public class Main {
public static void main(String[] args) {
List<ScrapeMeProduct> products = Scraper.scrape();
products.forEach(System.out::println);
// continue coding
}
}
So far we have learned to use Java for simple static page data crawling, the next we will be advanced on this basis, using Java concurrent crawling Scrapeme above all the product data , as well as the use of Java code connected to the Nstbrowser browser for data crawling , because the advanced chapter will be used to Nstbrowser in the headless browser and other features.