Puppeteer: Page Functions
Updated 2020-07-21
Project Setup
Make sure you have created a project and installed puppeteer by running npm install puppeteer
from terminal in our project folder. Part 1 of this series goes over how what Puppeteer is and how to set up a project using it.
The package.json
file should look something like this:
{
"name": "puppeteer-page-functions",
"version": "1.0.0",
"dependencies": {
"puppeteer": "^5.0.0"
}
}
Create a file named scraper.js
in your project folder and add the code required to launch a browser and navigate to a website.
// Require the puppeteer library
const puppeteer = require('puppeteer');
// To use await we wrap our code in an async function
async function scrape() {
// Create the browser without headless mode
const browser = await puppeteer.launch({ headless: false });
// Create a new page (tab) in the browser
const page = await browser.newPage();
// Navigate to a website
await page.goto(
'https://returnstring.com/articles/puppeteer-page-functions'
);
// CODE USING THE BROWSER HERE
// Close the browser
await browser.close();
}
// Run our function
scrape().catch(console.error);
Note that we are navigating to this article's page with page.goto
for this tutorial.
Evaluating JavaScript
Directly evaluating javascript on the website in the browser is extremely useful. Puppeteer can send a function we provide inside the page, execute it in the page's context, and even return back the result from our function to our Node context. This allows us to execute code that interacts with the page, DOM and other JavaScript loaded on the page.
Using page.evaluate
we can provide it a function as an argument, called a page function, to run on the page. In this function we are no longer writing code for Puppeteer running in Node, we are instead writing normal JavaScript that will run in the browser. That means we can use document.querySelector
as well as most other browser APIs.
// This function will be executed inside the page
function myPageFunction() {
// Find the first h1 on the page
const h1Node = document.querySelector('h1');
// Get the element's innerHTML
const html = h1Node.innerHTML;
// Return the innerHTML to the code running in Node
return html;
}
// Run our page function inside the browser's page
const response = await page.evaluate(myPageFunction);
// Logging innerHTML to the terminal
console.log(response);
This will return the innerHTML
of the first <h1>
tag on the page and then display it in our terminal
Dynamic Page Functions
Page functions can have parameters which we pass in as additional arguments after the page function to page.evaluate
. This allows us to write generalized functions that operate based on our input. All that is a fancy way of saying it works like a function! We can declare our page function has parameters, we just have to remember to pass them in.
// This function will be executed inside the page
function myPageFunction(selector) {
// Find the first thing matching our selector on the page
const elementNode = document.querySelector(selector);
// Get the element's innerHTML
const html = elementNode.innerHTML;
// Return the innerHTML to the code running in Node
return html;
}
// Run our page function looking for the first h1
const response = await page.evaluate(myPageFunction, 'h1');
// Logging innerHTML to the terminal
console.log(response);
This is the same page function as before, but instead of only finding the first <h1>
tag on the page, it can look for any selector and return the innerHTML
. Hint: If you replace h1
with article
, you have copied this entire article.
The Code
This is the above example inserted in our file scraper.js
// Require the puppeteer library
const puppeteer = require('puppeteer');
// This function will be executed inside the page
function myPageFunction(selector) {
// Find the first thing matching our selector on the page
const elementNode = document.querySelector(selector);
// Get the element's innerHTML
const html = elementNode.innerHTML;
// Return the innerHTML to the code running in Node
return html;
}
// To use await we wrap our code in an async function
async function scrape() {
// Create the browser without headless mode
const browser = await puppeteer.launch({ headless: false });
// Create a new page (tab) in the browser
const page = await browser.newPage();
// Navigate to a website
await page.goto(
'https://returnstring.com/articles/puppeteer-page-functions'
);
// Run our page function looking for the first h1
const response = await page.evaluate(myPageFunction, 'h1');
// Logging innerHTML to the terminal
console.log(response);
// Close the browser
await browser.close();
}
// Run our function
scrape().catch(console.error);
Conclusion
In this series you have learned the basics of how to use Puppeteer. Puppeteer is full of useful tools, smart ways of doing things and shortcuts; it is a wonderful tool to learn more about if you are looking to use it professionally.
However, with the basic skills you have learned so far, you can write amazing things yourself! With the important page.evaluate
you can now do anything you can do with JavaScript in a browser. Go forth and automate your browser!
Next Steps
You now have the skills to automate the browser for almost any purpose: Go create a scraping project with Puppeteer! If you would like inspiration, perhaps check out the tutorial on Making Stakeholders Happy With Daily Screenshots /LINK WHEN WRITTEN
The last part of this series we will make some improvements to our setup by handling errors.