Browser Service using Puppeteer + Lambda + Serverless

May 19, 2020 · coding

In my recent project, I was asked to implement several features that needs to use Puppeteer, mostly using it to open web pages and get content.

The first feature is generating screenshot of given website URLs, a service that takes a URL and return Base64.

The second one is similar, but instead of generating a screenshot of a given URL, it is required to general a thumbnail of a product detail page. To achieve that, we came up with an idea that creating a dedicated thumbnail web page for the detail page, to which we used a secure way passing in (injection) essential data. We will take a screenshot of that page.

The last one is a typical crawler work, we want to fetch data from a specific page, given the target page is rendered by JavaScript, it would be only feasible we use Puppeteer to open the page and inject JavaScript to get data from the rendered page.

Requirement

To achieve the features above:

we need a service that is scalable, as running a headless brower can be resource costing and a high concurrency is possible.
We also need the task to be run relatively isolated with each other, as multiple headless brower instances running together can be unstable;
At last, the service should be independent with the main application, so that the failure of the service will not crash the application.

Solution

To meet all these requirements, we ended up setting up the Puppeteer service in Lambda using Serverless. The lambda is in the same VPC as the application so that we can using AWS’s API to invoke the lambda instance straight away.

Serverless Config

If you haven’t heard of Serverless before, checkout its [website}(https://www.serverless.com/).

To make it simple, for me Serverless is mainly about two things:

Code as Infrastructure: Instead of manually creating cloud resources, describe all the resources and their relationship in a configuration file. This is mainly based on cloud platform services like Cloud Formation
Serverless Compute: Services like Lambda allows you to run you code in a sandbox with limited execution time and resource without worrying about server itself. While each lambda instances are isolated from each other, you are allowed to launch as many instances as you want.

To get a quick overview picture of how do we use serverless in this case, let’s take a look at the serverless config file serverless.yml (for all its yaml reference for AWS, see here):

service: headless-browser-service
provider:
  name: aws
  runtime: nodejs12.x
  memorySize: 3008
  timeout: 600
  region: us-east-1

plugins:
  - serverless-offline

package:
  exclude:
    - node_modules/puppeteer/.local-chromium/**

functions:
  screenshotTask:
    handler: tasks/screenshot.createPageScreenshot
    memorySize: 3008MB
    timeout: 600

Let me breakdown it bit by bit.

service: headless-browser-service

This will be the unique name as a service to be generated in “Cloud Formation” once deployed in AWS. The service will then based on the yaml configuration to update or create resources.

The provider section is about what cloud platform the service will be running and the resource config for lambda or its equivelents.

The functions part:

functions:
  screenshotTask:
    handler: tasks/screenshot.createPageScreenshot
    memorySize: 3008MB
    timeout: 600

This is where you function or lambda function actually mapping to your actual code. In this case, the code that will be run sits in a file tasks/screenshot.js, which exports a function called createPageScreenshot.

Invoke Lambda Locally

Notice we added a plugin for serverless:

plugins:
  - serverless-offline

Serverless offline emulates AWS Lambda and API Gateway on your local machine to speed up your development cycles.

We can test out our task by the command below:

serverless invoke local --function screenshotTask --data '{"url":"https://google.com"}'

Puppeteer in Lambda

If using Puppeteer directly in Lambda, it will fail as Chromium needs certain libraries. So instead of using Puppeteer directly we will use chrome-aws-lambda, it’s a NPM module that includes a appropriate Chromium Binary that can be run in a Lambda function.

You will also need to install puppeteer-core as chrome-aws-lambda actually uses this library to control an existing Chromium binary file.

Once we solve this issue, we also need a way to run puppeteer in our local, as the precompiled Chromium binary file from chrome-aws-lambda is specifically for AWS Lambda, it might not be compatible for your local machine. So to be able to run Puppeteer in local, the workaround is to install puppeteer as a development dependency. See here for details.

Overrall, except for the serverless config file, our NodeJS package.json will be like below:

{
  ...
  "dependencies": {
    "chrome-aws-lambda": "2.0.x",
    "puppeteer-core": "2.0.x",
    "serverless": "^1.57.0",
    "serverless-offline": "^3.31.3"
  },
  "devDependencies": {
    "puppeteer": "2.0.x"
  }
}

Notice I keeps the versions of the chrome-aws-lambda, puppeteer-core and puppeteer the same to make sure the production will use the same version as the one used in local development. The versions of chrome-aws-lambda is equivelent to puppeteer’s version. You can find the version mapping here.

Puppeteer Lambda Task

After the hard work of the configuration, the task itself is relatively easy. Below is an example in tasks/screenshot.js:

const chromium = require('chrome-aws-lambda');

async function getChrome() {
    console.log('execute path', await chromium.executablePath);
    return await chromium.puppeteer.launch({
        args: chromium.args,
        defaultViewport: chromium.defaultViewport,
        executablePath: await chromium.executablePath,
        ignoreHTTPSErrors: true,
        headless: chromium.headless,
    });
};

module.exports.createPageScreenshot = async (event) => {
    const pageURL = event['page_url'];
    const pageWidth = event['page_width'] || 1024;
    const pageHeight = event['page_height'] || 768;
    if (!pageURL) {
        throw 'page url not found';
    }
    const browser = await getChrome();
    const version = await browser.version();
    const page = await browser.newPage();
    await page.setCacheEnabled(false);
    await page.setViewport({
        width: pageWidth,
        height: pageHeight,
    });
    await page.goto(pageURL, { timeout: 0, waitUntil: 'networkidle0' });
    const screenBase64 = await page.screenshot({
        type: 'jpeg',
        fullPage: false,
        encoding: 'base64',
    });

    try {
        await page.close();
        await browser.close();
    } catch(e) {
        console.error('closing page or browser error', e);
    }
    return {
        screenshot: screenBase64,
    };
};

You will be able to use the main puppeteer API except using chrome-aws-lambda’s API to launch the browser.