Web Scraping with Java in 2024 - The Most Detailed Guide

This guide focuses on web scraping with Java in 2024. It provides a detailed step-by-step tutorial. Begin exploring now!

Apr 02, 2024Carlos Rivera

Web Scraping Basic Tutorial

This guide focuses on web scraping with Java in 2024. It provides a step-by-step tutorial on how to extract data from websites using the Scrapeme site as an example. By utilizing Java and the jsoup library, you will learn how to scrape static web page resources and retrieve specific information such as product names, images, prices, and details.

This guide equips you with the necessary skills to scrape data from similar websites and serves as a foundation for more advanced web scraping techniques.

Get ready to harness the power of Java for web scraping in this modern era!

Environment Requirements:

It is important to note that since we will be using Java for the demo project in this article, please make sure you have the following prerequisites in place before proceeding:

Recommended Environment:

Java: Any version 8+ LTS
Build tools: Any version of Gradle or Maven compatible with your local Java version
IDE: Any of your preferences, such as Eclipse, IntelliJ IDEA, VS Code

Note: The installation process for the environment is omitted.

Case Environment:

JDK 21

javascript Copy

# java -version
java version "21.0.2" 2024-01-16 LTS
Java(TM) SE Runtime Environment (build 21.0.2+13-LTS-58)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.2+13-LTS-58, mixed mode, sharing)

Build tool: Gradle

javascript Copy

# gradle -version
Gradle 8.7

Build time: 2024-03-22 15:52:46 UTC
Revision: 650af14d7653aa949fce5e886e685efc9cf97c10

Kotlin: 1.9.22
Groovy: 3.0.17
Ant: Apache Ant(TM) version 1.10.13 compiled on January 4 2023
JVM: 21.0.2 (Oracle Corporation 21.0.2+13-LTS-58)
OS: Mac OS X 14.4.1 aarch64

IDE: IntelliJ IDEA

Project Creation:

Project Information

After creation, your project may look like this:

Dependencies are added, and for the time being, jsoup will suffice (jsoup: Java HTML Parser):

javascript Copy

// gradle => build.gradle => dependencies
implementation 'org.jsoup:jsoup:1.17.2'

Let's Crawl ScrapeMe!

Now it's the right time to scrape the website! Here I will crawl the ScrapeMe, just as a reference. You can go through all the progress and then finish your project.

Target Site Analysis:

First, let's take a look at the data we want to scrape from the Scrapeme site. Open the site in a browser and view the source code to analyze the target elements. Then, retrieve the desired elements through code.

ScrapeMe Homepage Product List:

ScrapeMe Product Details:

Our goal is to crawl the home page of these products information, including product name, product images, product price, and product details address.

Page Source Code Analysis

Page Elements

By looking at the page element analysis, we know that the current page has all the product page elements: ul.products, and each product detail element: li.product.

Product Details

Further analysis finds product name: a h2, product image: a img.src, product price: a span, product details address: a.href.

Code Demonstration

ScrapeMeProduct.java

java Copy

public class ScrapeMeProduct {
  /**
   * product detail url
   */
  private String url;
  /**
   * product image
   */
  private String image;
  /**
   * product name
   */
  private String name;
  /**
   * product price
   */
  private String price;

  // Getters and Setters
  
  @Override
  public String toString() {
    return "{ \"url\":\"" + url + "\", "
        + " \"image\": \"" + image + "\", "
        + "\"name\":\"" + name + "\", "
        + "\"price\": \"" + price + "\" }";
  }
}

Scraper.java

java Copy

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Scraper {
  // scrapeme site url
  private static final String SCRAPEME_SITE_URL = "https://scrapeme.live/shop";

  public static List<ScrapeMeProduct> scrape() {
    // html doc for scrapeme page
    Document doc;
    // products data
    List<ScrapeMeProduct> pokemonProducts = new ArrayList<>();
    try {
      doc = Jsoup.connect(SCRAPEME_SITE_URL)
          .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36") // mock userAgent header
          .header("Accept-Language", "*") // mock Accept-Language header
          .get();
      // select product nodes
      Elements products = doc.select("li.product");
      for (Element product : products) {
        ScrapeMeProduct pokemonProduct = new ScrapeMeProduct();
        pokemonProduct.setUrl(product.selectFirst("a").attr("href")); // parse and set product url
        pokemonProduct.setImage(product.selectFirst("img").attr("src")); // parse and set product image
        pokemonProduct.setName(product.selectFirst("h2").text()); // parse and set product name
        pokemonProduct.setPrice(product.selectFirst("span").text()); // parse and set product price
        pokemonProducts.add(pokemonProduct);
      }
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
    return pokemonProducts;
  }
}

Main.java

java Copy

import io.xxx.basic.ScrapeMeProduct;
import io.xxx.basic.Scraper;
import java.util.List;

public class Main {
  public static void main(String[] args) {
    List<ScrapeMeProduct> products = Scraper.scrape();
    products.forEach(System.out::println);
    // continue coding
  }
}

Results Expression

Conclusion

So far we have learned to use Java for simple static page data crawling, the next we will be advanced on this basis, using Java concurrent crawling Scrapeme above all the product data , as well as the use of Java code connected to the Nstbrowser browser for data crawling , because the advanced chapter will be used to Nstbrowser in the headless browser and other features.