Google designed Puppeteer to provide a simple yet powerful interface in Node.js for automating tests and various tasks using the Chromium browser engine. It runs headless by default, but it can be configured to run full Chrome or Chromium.
The API build by the Puppeteer team uses the DevTools Protocol to take control of a web browser, like Chrome, and perform different tasks, like:
- Snap screenshots and generate PDFs of pages
- Automate form submission
- UI testing (clicking buttons, keyboard input, etc.)
- Scrape a SPA and generate pre-rendered content (Server-Side Rendering)
Most actions that you can do manually in the browser can also be done using Puppeteer. Furthermore, they can be automated so you can save more time and focus on other matters.
Puppeteer was also built to be developer-friendly. People familiar with other popular testing frameworks, such as Mocha, will feel right at home with Puppeteer and find an active community offering support for Puppeteer. This led to massive growth in popularity amongst the developers.
Of course, Puppeteer isn’t suitable only for testing. After all, if it can do anything a standard browser can do, then it can be extremely useful for web scrapers. Namely, it can help with executing javascript code so that the scraper can reach the page’s HTML and imitating normal user behavior by scrolling through the page or clicking on random sections.
These much-needed functionalities make headless browsers a core component for any commercial data extraction tool and all but the most simple homemade web scrapers.
First and foremost, make sure you have up-to-date versions of Node.js and Puppeteer installed on your machine. If that isn’t the case, you can follow the steps below to install all prerequisites.
You can download and install Node.js from here. Node’s default package manager npm comes preinstalled with Node.js.
To install the Puppeteer library, you can run the following command in your project root directory:
npm install puppeteer # or "yarn add puppeteer"
Note that when you install Puppeteer, it also downloads the latest version of Chromium that is guaranteed to work with the API.
Keep in mind that Puppeteer is a promise-based library (it performs asynchronous calls to the headless Chrome instance under the hood). So let’s keep the code clean by using async/await.
First, create a new file called index.js in your project root directory.
Inside that file, we need to define an asynchronous function and wrap it around all the Puppeteer code.
const puppeteer = require('puppeteer') async function snapScreenshot() { try { const URL = 'https://old.reddit.com/' const browser = await puppeteer.launch() const page = await browser.newPage() await page.goto(URL) await page.screenshot({ path: 'screenshot.png' }) await browser.close() } catch (error) { console.error(error) } } snapScreenshot()
First, an instance of the browser is started using the puppeteer.launch() command. Then, we create a new page using the browser instance. For navigating to the desired website, we can use the goto() method, passing the URL as a parameter. To snap a screenshot, we’ll use the screenshot() method. We also need to pass the location where the image will be saved.
Note that Puppeteer sets an initial page size to 800×600px, which defines the screenshot size. You can customize the page size using the setViewport() method.
Don’t forget to close the browser instance. Then all you have to do is run node index.js in the terminal.
It really is that simple! You should now see a new file called screenshot.png in your project folder.
Happy Coding …