Return String

Puppeteer: Initial Setup

Updated 2020-07-21

What is Puppeteer

Puppeteer is a way for you to control a browser through code. Puppeteer is fast, easy to use and comes with a browser bundled so you can start testing or scraping websites with minimal setup. That minimal setup is what this article will cover.

Project Setup

Let us start our journey by creating a folder for our project, and then a new file named scraper.js in our project folder. This is where we will write all of our code.

Make sure you have Node and NPM installed. You can verify this by running node -v and npm -v in a terminal. Your Node version must be 14.3.0 or greater.

Run npm init -y from a terminal in our project folder. This will create a new NPM project with the default values. Feel free to modify your package.json with your own values, or copy the one shown in this article.

To install puppeteer; run npm install puppeteer from a terminal in your project folder. This will download chromium as well as puppeteer to your node_modules folder: Everything needed to get started automating browser actions!

By the end of these steps your package.json file should look something like this:

{
    "name": "puppeteer-initial-setup",
    "version": "1.0.0",
    "dependencies": {
        "puppeteer": "^5.0.0"
    }
}

Using Puppeteer

Setup

To get started programming Puppeteer to do our bidding; let us open the scraper.js file we created in whichever editor you prefer.

First, we will have to require Puppeteer using commonjs require.

const puppeteer = require(`puppeteer`);

To use the ES2017 await keyword, we will have to wrap our scraping code in an asynchronous function and then call it. Our scaffolding looks something like this:

async function scrape() {
    // CODE GOES HERE
}

// Run our function
scrape().catch(console.error);

Adding the .catch(console.error) will make sure any error messages are easier to read: A full explanation can be found in the last article in this series.

Launching a Browser

To start using Puppeteer to control Chromium, we will first have to launch a new browser instance. Inside our scrape function add the following code to launch a browser:

const browser = await puppeteer.launch();

The await keyword makes sure that Node waits for the function puppeteer.launch() to finish before executing the next line: in this case the code will wait for the actual browser to launch.

With a browser launched, we now either have to grab the default tab (called a page in puppeteer) or create a new one. The easiest solution is to simply create a new page with browser.newPage() and await its creation.

const page = await browser.newPage();

Navigating to Websites

What good is opening a browser without going to a website. Even though this is just an initial setup tutorial, let us navigate to a website so you can see how easy it is to use Puppeteer.

await page.goto('https://returnstring.com');

Puppeteer provides an easy way to navigate to a website with page.goto(URL). Simply provide it a URL and Puppeteer will navigate there and wait for the page to load; just as if you had written it in the address bar in the browser.

Closing the Browser

At the end of our function we should make sure the browser closes properly. To do that we always end our function job by calling browser.close.

await browser.close();

The Code So Far

// Require the puppeteer library
const puppeteer = require('puppeteer');

// To use await we wrap our code in an async function
async function scrape() {
    // Create the browser without headless mode
    const browser = await puppeteer.launch({ headless: false });

    // Create a new page (tab) in the browser
    const page = await browser.newPage();

    // Navigate to a website
    await page.goto('https://returnstring.com');

    // Close the browser
    await browser.close();
}

// Run our function
scrape().catch(console.error);

With everything wired up, we can do almost everything a normal user can do by interacting with our browser and page objects. Next let us look at running our code with Node.

Running Our Code

Headless

Puppeteer is headless by default. That means Puppeteer can run a browser without rendering anything to a screen. Running Puppeteer headless is fast but makes it needlessly hard debugging your code while developing it. To run Puppeteer without headless mode we can give browser.launch an options object as an argument.

const browser = await puppeteer.launch({ headless: false });

Adding an options object with headless: false makes Puppeteer launch the browser visibly so you can follow along on your screen.

Using Node

To execute our scraper with Node; run node scraper.js from a terminal in our project folder. If you added the headless: false option you should see a window pop up and close again.

Waiting With Puppeteer

If you have trouble seeing the window pop up, you can make Puppeteer wait for any amount of milliseconds by calling page.waitFor.

await page.waitFor(2000);

Adding this before we call browser.close will make the browser wait for 2 seconds before closing itself.

The Code

// Require the puppeteer library
const puppeteer = require('puppeteer');

// To use await we wrap our code in an async function
async function scrape() {
    // Create the browser without headless mode
    const browser = await puppeteer.launch({ headless: false });

    // Create a new page (really just a tab) in the Chromium browser
    const page = await browser.newPage();

    // Navigate to a website
    await page.goto('https://returnstring.com');

    // Wait for 2 seconds
    await page.waitFor(2000);

    // Close the browser
    await browser.close();
}

// Run our function
scrape().catch(console.error);

Next Steps

You have successfully created a project, launched a browser, navigated to a website and have a function ready for expansion!

To learn how to interact with the browser, take screenshots and extract values from the page, continue this series with the next article: Puppeteer: Taking Screenshots