{ Web Scraping with Node.js. }

Objectives:

By the end of this chapter, you should be able to:

  • Define what web scraping is
  • Use cheerio to scrape data from a website

Web Scraping

Web scraping is the process of downloading and extracting data from a website. There are 3 main steps in scraping:

  1. Downloading the HTML document from a website (we will be doing this with the request module)
  2. Extracting data from the downloaded HTML (we will be doing this with cheerio)
  3. Doing something with the data (usually saving it somehow, e.g. by writing to a file with fs or saving to a database)

Typically, you would want to access the data using a website's API, but often websites don't provide this programmatic access. When a website doesn't provide a programmatic way to download data, web scraping is a great way to solve the problem!

Robots.txt

Before you begin web scraping, it is a best practice to understand and honor a site's robots.txt file. The file may exist on any website that you visit and its role is to tell programs (like our web scraper) about rules on what it should and should not download on the site. Here is Rithm School's robots.txt file. As you can see, it doesn't provide any restrictions. Compare that file to Craigslist's robots.txt file which is much more restrictive on what can be downloaded by a program.

You can find out more information about the robots.txt file here.

Using cheerio

Cheerio is one of the many modules Node has for web scraping, but it is by far the easiest to get up and running with especially if you know jQuery! The library is based off of jQuery and has identical functions for finding, traversing and manipulating the DOM. However, cheerio expects you to have an HTML page which it will load for you to work with. In order to retrieve the page, we need to make an HTTP request to get the HTML and we will be using the request module to do that. Let's start with a simple application:

mkdir scraping_example && cd scraping_example
touch app.js
npm init -y
npm install --save cheerio request

Now in our app.js, let's scrape the first page of Craigslist:

var cheerio = require("cheerio");
var request = require("request");

request('https://sfbay.craigslist.org/search/apa?bedrooms=1&bathrooms=1&availabilityMode=0', function(err, response, body){
    var $ = cheerio.load(body);
    // let's see the average price of 1 bedroom and bathroom in san francisco (based on 1 page of craigslist...)
    var avg = Array.from($(".result-price")).reduce(function(acc,next){
        return acc + parseInt($(next).text().substr(1));
    }, 0) / $(".result-price").length;
    console.log(`Average 1 bedroom price: \$${avg.toFixed(2)}`)
});

In the terminal, if you run node app.js, it should tell you what the average price of a one-bedroom apartment is in the Bay Area!

Example Application

You can find a small example with cheerio here.

When you're ready, move on to Background Jobs with Kue

Continue