In my recent project, I was asked to implement several features that needs to use Puppeteer, mostly using it to open web pages and get content.
The first feature is generating screenshot of given website URLs, a service that takes a URL and return Base64.
The second one is similar, but instead of generating a screenshot of a given URL, it is required to general a thumbnail of a product detail page. To achieve that, we came up with an idea that creating a dedicated thumbnail web page for the detail page, to which we used a secure way passing in (injection) essential data. We will take a screenshot of that page.
The last one is a typical crawler work, we want to fetch data from a specific page, given the target page is rendered by JavaScript, it would be only feasible we use Puppeteer to open the page and inject JavaScript to get data from the rendered page.
Requirement
To achieve the features above:
- we need a service that is scalable, as running a headless brower can be resource costing and a high concurrency is possible.
- We also need the task to be run relatively isolated with each other, as multiple headless brower instances running together can be unstable;
- At last, the service should be independent with the main application, so that the failure of the service will not crash the application.
Solution
To meet all these requirements, we ended up setting up the Puppeteer service in Lambda using Serverless. The lambda is in the same VPC as the application so that we can using AWS’s API to invoke the lambda instance straight away.
Serverless Config
If you haven’t heard of Serverless before, checkout its [website}(https://www.serverless.com/).
To make it simple, for me Serverless is mainly about two things:
- Code as Infrastructure: Instead of manually creating cloud resources, describe all the resources and their relationship in a configuration file. This is mainly based on cloud platform services like Cloud Formation
- Serverless Compute: Services like Lambda allows you to run you code in a sandbox with limited execution time and resource without worrying about server itself. While each lambda instances are isolated from each other, you are allowed to launch as many instances as you want.
To get a quick overview picture of how do we use serverless in this case, let’s take a look at the serverless config file serverless.yml
(for all its yaml reference for AWS, see here):
service: headless-browser-service
provider:
name: aws
runtime: nodejs12.x
memorySize: 3008
timeout: 600
region: us-east-1
plugins:
- serverless-offline
package:
exclude:
- node_modules/puppeteer/.local-chromium/**
functions:
screenshotTask:
handler: tasks/screenshot.createPageScreenshot
memorySize: 3008MB
timeout: 600
Let me breakdown it bit by bit.
service: headless-browser-service
This will be the unique name as a service to be generated in “Cloud Formation” once deployed in AWS. The service will then based on the yaml configuration to update or create resources.
The provider
section is about what cloud platform the service will be running and the resource config for lambda or its equivelents.
The functions
part:
functions:
screenshotTask:
handler: tasks/screenshot.createPageScreenshot
memorySize: 3008MB
timeout: 600
This is where you function or lambda function actually mapping to your actual code. In this case, the code that will be run sits in a file tasks/screenshot.js
, which exports a function called createPageScreenshot
.
Invoke Lambda Locally
Notice we added a plugin for serverless:
plugins:
- serverless-offline
Serverless offline emulates AWS Lambda and API Gateway on your local machine to speed up your development cycles.
We can test out our task by the command below:
serverless invoke local --function screenshotTask --data '{"url":"https://google.com"}'
Puppeteer in Lambda
If using Puppeteer directly in Lambda, it will fail as Chromium needs certain libraries. So instead of using Puppeteer directly we will use chrome-aws-lambda, it’s a NPM module that includes a appropriate Chromium Binary that can be run in a Lambda function.
You will also need to install puppeteer-core as chrome-aws-lambda actually uses this library to control an existing Chromium binary file.
Once we solve this issue, we also need a way to run puppeteer in our local, as the precompiled Chromium binary file from chrome-aws-lambda is specifically for AWS Lambda, it might not be compatible for your local machine. So to be able to run Puppeteer in local, the workaround is to install puppeteer as a development dependency. See here for details.
Overrall, except for the serverless config file, our NodeJS package.json
will be like below:
{
...
"dependencies": {
"chrome-aws-lambda": "2.0.x",
"puppeteer-core": "2.0.x",
"serverless": "^1.57.0",
"serverless-offline": "^3.31.3"
},
"devDependencies": {
"puppeteer": "2.0.x"
}
}
Notice I keeps the versions of the chrome-aws-lambda
, puppeteer-core
and puppeteer
the same to make sure the production will use the same version as the one used in local development. The versions of chrome-aws-lambda
is equivelent to puppeteer’s version. You can find the version mapping here.
Puppeteer Lambda Task
After the hard work of the configuration, the task itself is relatively easy. Below is an example in tasks/screenshot.js
:
const chromium = require('chrome-aws-lambda');
async function getChrome() {
console.log('execute path', await chromium.executablePath);
return await chromium.puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
ignoreHTTPSErrors: true,
headless: chromium.headless,
});
};
module.exports.createPageScreenshot = async (event) => {
const pageURL = event['page_url'];
const pageWidth = event['page_width'] || 1024;
const pageHeight = event['page_height'] || 768;
if (!pageURL) {
throw 'page url not found';
}
const browser = await getChrome();
const version = await browser.version();
const page = await browser.newPage();
await page.setCacheEnabled(false);
await page.setViewport({
width: pageWidth,
height: pageHeight,
});
await page.goto(pageURL, { timeout: 0, waitUntil: 'networkidle0' });
const screenBase64 = await page.screenshot({
type: 'jpeg',
fullPage: false,
encoding: 'base64',
});
try {
await page.close();
await browser.close();
} catch(e) {
console.error('closing page or browser error', e);
}
return {
screenshot: screenBase64,
};
};
You will be able to use the main puppeteer API except using chrome-aws-lambda’s API to launch the browser.