How to implement a web crawler using front-end Node.js

Preface

Today, we will use traditional programming to build a web crawler that scrapes the movie list from Douban Movies. This article will introduce how to build a simple web crawler using Node.js. Taking scraping information from Douban Movies Top 250 as an example, we will analyze the web page structure, send HTTP requests, parse HTML content, and finally save the required information as a JSON file.

Web Crawler

A web crawler, often simply called a crawler, officially known as web crawler or web spider, is a program or script designed to automatically browse the Internet and collect information from it. They are usually used to collect data from specific websites for analysis, indexing or other purposes. Web crawlers obtain web page content through HTTP requests, then parse the content to extract required information such as text, links, images, etc.

Simply put, the workflow of a web crawler usually includes the following steps:

  1. Determine entry URLs: Identify several initial web addresses as the starting point for crawling.
  2. Send HTTP requests: The crawler sends an HTTP request to the target website to request the page content.
  3. Get web page content: After the target website receives the request, it returns the corresponding HTML page.
  4. Parse web page content: The crawler uses parsing libraries (such as BeautifulSoup, Scrapy, Cheerio, etc.) to parse the web page content and extract the required data.
  5. Data processing: The crawler processes, cleans, stores or performs other operations on the extracted data.
  6. Loop execution: Repeat the above steps until the predetermined stop condition is met, such as scraping enough data, or reaching the preset depth or breadth limit.

Web crawlers are widely used in search engines, data mining, monitoring, information aggregation and other fields. However, it is worth noting that when using web crawlers, you need to comply with the website's robots.txt protocol and relevant laws and regulations to ensure that the crawling behavior is legal and compliant.

Preparation Process

Step 1: Install VS Code (Download Visual Studio Code - Mac, Linux, Windows) and Node.js (Download | Node.js Chinese Website (nodejs.cn)), create a folder, right-click and select Open in Integrated Terminal to enter the terminal.

Create the movie-crawl folder in VS Code, as shown below:

Right-click on the movie-crawl folder, select Open in Integrated Terminal to enter the terminal, as shown below:

Step 2: Enter npm init -y in the terminal. This initializes the project as a backend project, and generates the package.json project description file.

Enter npm init -y

package.json project description file

Step 3: Create the entry file main.js, outline the programming workflow, and you can start writing code.

Node.js Web Crawler

Crawling Idea: The core idea is: if the data is available on the target site, we retrieve it. First, we send an HTTP request to https://movie.douban.com/top250 to get the HTML string. Then we parse the HTML string to get the movie list. Finally, we assemble all movie objects into an array and output it as a JSON array.

Code Implementation

  1. Import required modules: First, we import the required modules: request-promise is used to send HTTP requests, cheerio is used to parse HTML, fs is used for file operations, and util provides various utility functions.
// Import required modules
let request = require('request-promise') // Use the request-promise module for HTTP requests, needs to be installed first
let cheerio = require('cheerio') // Use the cheerio module for HTML parsing, needs to be installed first
let fs = require('fs') // Use the fs module for file operations
const util = require('util') // Use utility functions provided by the util module

Enter npm i request-promise in the terminal to install the request-promise module, and npm i cheerio to install the cheerio module. npm i is used to install third-party packages; to install any other module, simply add the module name after npm i

This indicates that the installation was successful

  1. Movie information array: We define an empty array movies to store movie information.
// Array for storing movie information
let movies = []
  1. Base URL: We define the base URL of Douban Movies Top 250 as basicUrl.
// Base URL of Douban Movies Top 250
const basicUrl = 'https://movie.douban.com/top250'
  1. One-time execution function: The once function ensures that a given function is only executed once.
// Function to ensure a given function is only executed once
let once = function (cb) {
    let active = false
    if (!active) {
        cb()
        active = true
    }
}
  1. Log function: The log function ensures that a piece of information is only output once.
// Log function, ensures a piece of information is only output once
function log(item) {
    once(() => {
        console.log(item)
    })
}
  1. Movie information parsing function: The getMovieInfo function parses information from a single movie node.
// Function to parse movie information
function getMovieInfo(node) {
    let $ = cheerio.load(node) // Use cheerio to load the HTML node
    let titles = $('.info .hd span') // Get the movie title element
    titles = ([]).map.call(titles, t => { // Use the map method to convert title elements to an array of text
        return $(t).text()
    })
    let bd = $('.info .bd') // Get the movie information element
    let info = bd.find('p').text() // Get the movie introduction text
    let score = bd.find('.star .rating_num').text() // Get the movie score text
    return { titles, info, score } // Return the movie information object
}
  1. Function to get movie information for a page: The getPage function is used to get the movie information of a specific page.
// Get movie information for a specific page
async function getPage(url, num) {
    let html = await request({ // Use request-promise to send an HTTP request to get the page HTML
        url
    })
    console.log('连接成功!', `正在爬取第${num+1}页数据`)
    let $ = cheerio.load(html) // Use cheerio to load the HTML page
    let movieNodes = $('#content .article .grid_view').find('.item') // Get movie nodes
    let movieList = ([]).map.call(movieNodes, node => { // Iterate over the list of movie nodes and parse movie information
        return getMovieInfo(node)
    })
    return movieList // Return the parsed list of movie information
}
  1. Main function: The main function is the main logic of the program, used to control the crawling process. It loops through each page to crawl movie information and writes the result to a JSON file.
// Main function, controls the crawling process
async function main() {
    let count = 25 // Number of movies per page
    let list = []
    // Loop to crawl movie information from each page
    for (let i = 0; i < count i let url="basicUrl" start="${25*i}`" construct the url for the current page list.push... await getpageurl i get the current pages movie information and add it to the list console.loglist.length output the number of crawled movies write the crawled movie information to a json file fs.writefile.output.json json.stringifylist utf-8> { // Write the movie information list to a JSON file
        console.log('生成json文件成功!')
    })
}
  1. Execute the main function: Call the main function at the end to execute the entire crawling process.
main()

Enter node .\main.js in the terminal

Result

Crawling completed successfully

Finally, the output.json file is generated

Summary

In this article, through step-by-step preparation and complete code implementation, we have successfully built a simple web crawler based on Node.js to retrieve information from Douban Movies Top 250. By sending HTTP requests and parsing HTML content, we obtained movie information including titles, introductions, and scores, and stored this information as a JSON file. This example demonstrates the basic working principle of web crawlers, and also provides hands-on practice with Node.js, request-promise, cheerio and other related modules.


This is a discussion topic separated from the original post at https://juejin.cn/post/7369238146733998089