Return String

Puppeteer: Page Functions

Updated 2020-07-21

Project Setup

Make sure you have created a project and installed puppeteer by running npm install puppeteer from terminal in our project folder. Part 1 of this series goes over how what Puppeteer is and how to set up a project using it.

The package.json file should look something like this:

{
    "name": "puppeteer-page-functions",
    "version": "1.0.0",
    "dependencies": {
        "puppeteer": "^5.0.0"
    }
}

Create a file named scraper.js in your project folder and add the code required to launch a browser and navigate to a website.

// Require the puppeteer library
const puppeteer = require('puppeteer');

// To use await we wrap our code in an async function
async function scrape() {
    // Create the browser without headless mode
    const browser = await puppeteer.launch({ headless: false });

    // Create a new page (tab) in the browser
    const page = await browser.newPage();

    // Navigate to a website
    await page.goto(
        'https://returnstring.com/articles/puppeteer-page-functions'
    );

    // CODE USING THE BROWSER HERE

    // Close the browser
    await browser.close();
}

// Run our function
scrape().catch(console.error);

Note that we are navigating to this article's page with page.goto for this tutorial.

Evaluating JavaScript

Directly evaluating javascript on the website in the browser is extremely useful. Puppeteer can send a function we provide inside the page, execute it in the page's context, and even return back the result from our function to our Node context. This allows us to execute code that interacts with the page, DOM and other JavaScript loaded on the page.

Using page.evaluate we can provide it a function as an argument, called a page function, to run on the page. In this function we are no longer writing code for Puppeteer running in Node, we are instead writing normal JavaScript that will run in the browser. That means we can use document.querySelector as well as most other browser APIs.

// This function will be executed inside the page
function myPageFunction() {
    // Find the first h1 on the page
    const h1Node = document.querySelector('h1');

    // Get the element's innerHTML
    const html = h1Node.innerHTML;

    // Return the innerHTML to the code running in Node
    return html;
}

// Run our page function inside the browser's page
const response = await page.evaluate(myPageFunction);

// Logging innerHTML to the terminal
console.log(response);

This will return the innerHTML of the first <h1> tag on the page and then display it in our terminal

Dynamic Page Functions

Page functions can have parameters which we pass in as additional arguments after the page function to page.evaluate. This allows us to write generalized functions that operate based on our input. All that is a fancy way of saying it works like a function! We can declare our page function has parameters, we just have to remember to pass them in.

// This function will be executed inside the page
function myPageFunction(selector) {
    // Find the first thing matching our selector on the page
    const elementNode = document.querySelector(selector);

    // Get the element's innerHTML
    const html = elementNode.innerHTML;

    // Return the innerHTML to the code running in Node
    return html;
}

// Run our page function looking for the first h1
const response = await page.evaluate(myPageFunction, 'h1');

// Logging innerHTML to the terminal
console.log(response);

This is the same page function as before, but instead of only finding the first <h1> tag on the page, it can look for any selector and return the innerHTML. Hint: If you replace h1 with article, you have copied this entire article.

The Code

This is the above example inserted in our file scraper.js

// Require the puppeteer library
const puppeteer = require('puppeteer');

// This function will be executed inside the page
function myPageFunction(selector) {
    // Find the first thing matching our selector on the page
    const elementNode = document.querySelector(selector);

    // Get the element's innerHTML
    const html = elementNode.innerHTML;

    // Return the innerHTML to the code running in Node
    return html;
}

// To use await we wrap our code in an async function
async function scrape() {
    // Create the browser without headless mode
    const browser = await puppeteer.launch({ headless: false });

    // Create a new page (tab) in the browser
    const page = await browser.newPage();

    // Navigate to a website
    await page.goto(
        'https://returnstring.com/articles/puppeteer-page-functions'
    );

    // Run our page function looking for the first h1
    const response = await page.evaluate(myPageFunction, 'h1');

    // Logging innerHTML to the terminal
    console.log(response);

    // Close the browser
    await browser.close();
}

// Run our function
scrape().catch(console.error);

Conclusion

In this series you have learned the basics of how to use Puppeteer. Puppeteer is full of useful tools, smart ways of doing things and shortcuts; it is a wonderful tool to learn more about if you are looking to use it professionally.

However, with the basic skills you have learned so far, you can write amazing things yourself! With the important page.evaluate you can now do anything you can do with JavaScript in a browser. Go forth and automate your browser!

Next Steps

You now have the skills to automate the browser for almost any purpose: Go create a scraping project with Puppeteer! If you would like inspiration, perhaps check out the tutorial on Making Stakeholders Happy With Daily Screenshots /LINK WHEN WRITTEN

The last part of this series we will make some improvements to our setup by handling errors.