Headless Scraping with Cheerio
March 21, 2022
web scraper, cheerio, web scraping, headless
Cheerio is a node package that allows you to easily parse and extract elements from markup. Unlike a browser, Cheerio doesn’t produce a visual rendering, load external resources, execute Javascript code, or apply CSS. As a result, it’s much faster than other solutions when it comes to scraping web pages for valuable data.
In this article, we’re going to specifically cover how to set up a headless scraper with Cheerio and spoof browser headers to access a wider range of web pages.
Requirements
To start with, you’ll need at least basic knowledge of jQuery and node package managers to install and use Cheerio. To set up your scraper, you’ll benefit from a knowledge of Chrome’s Developer Tools, Javascript, and the DOM. Furthermore, you’ll want to be familiar with an HTTP request tool like Axios to simplify your scraper development.
What to Expect
There are many sites on the web which require specific header information in order to deliver the entire HTML code to you. Typically, your browser will pass the headers along for you to receive the webpage, but, with headless browsers and scrapers, you must instead pass these along programmatically to receive data from the website. We’re going to show how you can use Axios to make an HTTP request with spoofed headers to these types of websites and then use Cheerio to isolate the specific data on the webpage you need.
Building and Testing a Headless Scraper
Pre-Scraping Browser Analysis
Before building a scraper, you’ll typically want to use a common browser like Chrome in order to pick out critical HTML and header information from the web page you want to scrape. The browser we’re using in this article is Chrome, and the example website we’re going to scrape can be found here. This website is useful because it specifically requires headers to access the text data on the webpage and can help you debug your own headless scraper.
Picking out the HTML Class
Typically, you won’t need all the HTML information from a web page. Rather, you’ll only need relevant pieces of data. To best take advantage of Cheerio’s ability to parse HTML, you’ll usually want to go into developer tools on your webpage and isolate the specific elements you need.
In this case, we want the text snippet which confirms the headers are correct. We can see within developer tools there are two classes associated with it: “col-md-4” and “col-md-offset-4”. To simplify our query, we’ll narrow this down to “col-md-4”. Later, we’ll show how you can use Cheerio to extract only elements in this class.
Specifying Headers
There are two ways to see what headers your browser is sending the web server. The easiest way is to check headers.cf from your browser. The second is checking the network tab in Developer tools as the page loads.
No matter which method you use, you’ll want to record these headers since you’ll later add them to your HTTP request in your scraper.
Building and Testing a Headless Scraper
Project Set-Up
You’ll want to set up a new project by first initializing the node package manager and installing both Cheerio and Axios.
npm init -y; npm i cheerio axios
Create your scraper file in the main project folder and start by requiring your two packages:
const axios = require('axios');
const cheerio = require('cheerio');
We’ll create a single asynchronous function for our Axios and Cheerio calls. Within this function, we’ll begin with an Axios GET request to our webpage:
const axios = require('axios');
const cheerio = require('cheerio');
(async () => {
try {
const response = await axios.get('https://www.scrapethissite.com/pages/advanced/?gotcha=headers');
} catch (error) {
console.log(error);
}
})();
In order to parse and read the response from our webpage, we’ll next use Cheerio. First, we load in the HTML response from Axios. We’re only using the data object from the response schema, so we specify this within Cheerio’s load function. Afterwards, we call only the <html> elements using Cheerio’s .html() method and specify the ‘.col-md-4’ class within Cheerio’s selector method $(). Keep in mind class isn’t the only way to specify HTML elements - there exist many others, and you can learn more about them here.
const axios = require('axios');
const cheerio = require('cheerio');
(async () => {
try {
const response = await axios.get('https://www.scrapethissite.com/pages/advanced/?gotcha=headers');
const $ = cheerio.load(response.data);
console.log($('.col-md-4').html());
} catch (error) {
console.log(error);
}
})();
With our basic scraper built, we’ll go ahead and run it to look at the results:
We see this specific webpage requires us to mimic ‘User-Agent’ and ‘Accept’ headers in order to receive the data we want.
It’s important to recognize most web pages won’t be so explicit about why you can’t access data. If your scraper can access the page but can’t pull the same data you saw in the browser, you should check for header requirements even if there’s no relevant message about headers.
We know we’ll have to spoof the headers, so let’s go back to our previous header list and grab the ‘User-Agent’ and ‘Accept’ header values. We’ll add these to our Axios GET request within the headers property.
const axios = require('axios');
const cheerio = require('cheerio');
(async () => {
try {
const response = await axios.get('https://www.scrapethissite.com/pages/advanced/?gotcha=headers' , {
headers: {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36",
},
});
const $ = cheerio.load(response.data);
console.log($('.col-md-4').html());
} catch (error) {
console.log(error);
}
})();
This successfully spoofs the headers, so our scraper’s console text matches the text from the browser:
Conclusion
In this article, we covered how to build and test a headless scraper to access web pages that would otherwise require a browser. We leveraged the Cheerio and Axios packages to help build our scraper, parse the web page’s HTML, and extract the specific data we wanted. There are two important things to remember when you build your own headless scraper:
- Oftentimes, the web page will not explicitly tell you which headers your scraper is missing and, instead, will return incorrect or null data. Remember to check headers from your browser first, and then add the headers to your HTTP request to avoid any issues.
- Adding headers will work in many cases but sometimes is insufficient for completely mimicking a browser when the server has a network stack detection bot. Properties like the order of the headers and the TLS fingerprint will reveal your scraper is not a browser despite acceptable header values.
With this being said, setting up a headless scraper with Axios and Cheerios is a relatively simple process that will work for many scraping jobs. Feel free to use this example as a blueprint for building your own headless scraper!