Letterhead

Side project

Built with

While writing various tools that generate invoices, delivery notes and other transactional documents for Weldmet, we were constantly needing to create PDFs from webpages. That experience for me as a developer was painful enough that it led to me and my co-founder building Letterhead, an easier way to programatically generate PDFs. Here are some of the things I learned about designing for developers, and building a robust and scalable API.

The Problem

Alright, so you're a developer with a freshly brewed cup of coffee and a keyboard ready to catch fire. You've picked up the task of writing a feature that will create a PDF invoice/quote/receipt/ticket from your database and send it to the customer by email. You know that the most common way to generate PDFs is to load an HTML webpage into a headless Chromium instance, and then render an approximation of the content to PDF, using a library like Puppeteer to control your Chromium instance. So, you cook up a simple self-hosted Puppeteer worker (or use a managed solution like Browserless) that visits the template on your Ruby on Rails app, and creates your PDF. Your template has a few bits of conditional logic to check if fields are present or not, an iteration for line items and a blocking request for a custom QR code image, but nothing too complicated. Since your webpage loads fine, and it looks fine in a handful of actual PDF renders on development and staging, you ship the whole thing to production and a few days pass…

Uh oh! You've just been pinged by a team member that some customers are confused because they've received completely blank invoices. Some invoices show blocks of ☐☐☐☐ instead of text. Some are missing their QR codes. Some invoices got all scrambled because you tried to create a footer and its position has wildly offset the rest of the document because someone's address took 4 lines not the 3 you tested on. Other invoices have had their page break inserted in the middle of your boss' signature. The list of issues goes on.

What went wrong?

So, what went wrong? When developing a PDF template, you need to see how several different input data shapes and network scenarios will affect your resulting PDF capture, not the webpage. The webpage is more like an artifact that can be helpful for debugging purposes. But you don't want to rely only on how the webpage appears because:

  • CSS properties (shadows, filters, backgrounds) and layouts (i.e., flexboxes, sticky elements, absolute elements) may look totally different in your PDF.
  • You need to see how page breaks will align with the flow of your document, and you can't test this reliably in the browser.
  • Fonts you have on your development machine may not be on the machine rendering your PDF.

So it went wrong because these situations weren't tested. But before you blame yourself, you didn't really have the capacity to do this when each time you wanted to make a change, you had to:

  1. Setup a ngrok tunnel to your local app
  2. Tweak the parameters on your template
  3. Invoke your PDF service manually
  4. Download and inspect the downloaded PDF

There was also no way to anticipate some conditions that only strike when you test at volume, or on your production server (like sporadically hanging third-party API requests or slow DB queries). Lastly, your attempt to get headers and footers to play nicely in Puppeteer forced you into a hack that broke your layout.

Having wrestled with all of these problems ourselves, we felt there needed to be a smooth, quick and simple way to create PDFs with code and we couldn't find one. So, we decided to create one.

The Solution

In order for this solution to be useful, it had two success criteria:

  • Save Developers Time and Reduce Complexity. This was the main reason for me building the tool in the first place. The status quo sucked and forced complexity on you in order to achieve basic things.
  • Not Break Production (Or The Bank) Under Stress. We wanted to help lots of other developers with this. This meant building an API that could reliably generate thousands of PDFs per minute, with serverless billing that would elastically scale up and down.

Solving for Saving Developers Time

Firstly, we needed a faster way to see the results of my changes to the template code. At the most basic level, I wanted an HTML editor on one side, and a preview of the generated PDF from the API on the other side. In that way, I could tweak code, hit ⌘ Enter and see the result with the only delay being the time to render. The first iteration of the tool did only that, but immediately felt like a step in the right direction:

An early version of the UI

The next thing that was slowing me down was bringing over the template code. For each change, I'd have to copy the resulting HTML over from the tab where I was running my local web server, to our new editor. This was still a lot of context switching, and also meant any changes I made in the new editor also had to be copied back to our web app. I wanted to work in one place, which meant doing the HTML template rendering in the tool too. I chose Liquid as the templating language of choice. Although Liquid was invented at Shopify for customizing storefronts, it's a simple, extensible and familiar language, making it ideal for static templating. To easily supply input variables to the Liquid engine, I added an in-built JSON editor tab so I could preview changes to code or input variables with the same velocity. The result is a mini-IDE that lets you iterate and test quickly:

I can then simply ping a request to api.letterhead.dev/v1/pdf. The request body is just the JSON I've been testing in the editor all along:

{
	// Specify your template ID and version
	"template": "customer-receipt@latest",
	// Supply the variables for use in the template
	"variables": {
    "name": "Oliver Brewster",
    "orderNumber": "14034056",
    "lineItems": [ ... ],
    "shippingAddress": [ ... ],
    "billingAddress": [ ... ],
    "itemsSubtotal": "$32.00",
    "shippingSubtotal": "$5.00",
    "total": "$37.00"
	}
}

…and voilà:

How the PDF turns out

Solving for Simplicity: Headers and Footers

TLDR; Headers and footers in Puppeteer are very limited.

There is functionality in Puppeteer to vertically sandwich your primary web browser viewport (the content) between two more viewports, that act as your header and footer. Puppeteer allows you to then supply separate HTML snippets, completely isolated from your primary document, to be loaded and captured at the same time as your content. In that way, each page of your PDF then comprises your header (if present), the content and the footer (if present). But since all headers (and all footers, respectively) must share the same HTML, there's no way to modify or hide them for certain pages. It's either all the pages show the same headers/footers, or none of the pages show headers/footers at all. To overcome this, you are forced to put the footer in the document itself and then hard-code vertical spacing between your content and your footer. This is both difficult to achieve and prohibitively error-prone.

To solve this in Letterhead, we took an entirely different approach. Instead of using Puppeteer's native functionality, we wrote two more Puppeteer scripts to accompany the primary content request. Before capturing their PDFs, these two additional workers manipulate the DOM to yield documents of equal page length to the primary content document, but only containing the selected header/footer elements for each page.

In the request you make to the API, your headers/footers can be specified like this:

{
	"headers": [
		// Show the primary header on pages 1, 2 and 3 only
		{ "selector": "#primaryHeader", "pages": [1, 2, 3] },
		// Show the secondary header on page 4 onwards
		{ "selector": "#secondaryHeader", "pages": "4..." },
	]
	"footers": [
		// Show the footer on all even-numbered pages, except the last
		{ "selector": "#footer", "pages": "even", "excludePages": [-1] },
	]
}

The API then merges the three documents together, and ensures that the content layer is sufficiently padded at the top and bottom to accomodate the header and footer layers:

In this way, you can use headers and footers to do exactly what you need to do. The good news, too, is that by running these two extra workers in parallel with the original 'content' worker, the additional work doesn't create noticeable latency.

Solving for Scale

The last major technical hurdle, and definitely the most challenging, was making the API scale. Although we had built Letterhead to scratch our own itch, we were keen to help other developers overcome the same problem. Before we could open up the API to be used by a wider group of users, whose appetite for documents might be in the tens or hundreds of thousands, I first needed to ensure it was capable of handling the extra load. It wasn't yet. There are usually multiple levers to look at when it comes to making API services scale. For Letterhead, this included rewriting the core PDF merge operations in Rust (interfacing with NAPI-RS) which I'll talk about in a follow-up piece. This piece tracks the infrastructure journey, which went from AWS to Azure to GCP, as I worked through a series of bottlenecks:

AWS

Until this point, we had been using Serverless Framework (on AWS Lambda) to run Puppeteer and Chromium, but we hadn't tested how our deployment would scale beyond the small number of requests I was firing at it. Using a simple Locust script (on LoadForge), We ran tests to incrementally spawn more requests with complex payloads. We quickly found the first limit of the API. After every burst of ~20 requests, Linux would throw the exception EMFILE: too many files open which would timeout the current request, and trigger a cold start for the following request. To dig deeper into why this was happening, I switched on Enhanced Monitoring for the Lambda function and traced the maximum of fd_use (namespace: /aws/lambda-insights) through CloudWatch Metrics:

Tracing file descriptor leaks in CloudWatch Metrics

What you can see is the file descriptor count spiking, because Lambda reuses the same container between bursts of requests in the same region. Once the spike reached Lambda's file descriptor limit of 1024, the request timed out and the container got destroyed. We identified and fixed the connection objects that were not being disposed properly in the application code, and this helped to reduce the leak count per request. But a number of file descriptors were still being leaked in each request. We were able to isolate that the issue was with Lambda, rather than the application code, by observing that the leaks would still occur when initializing Puppeteer on Lambda and not even performing any network requests. To solve this in the short term, I hacked together a health-checker (based on samswen's fix), whereby I monitored the file descriptor limit and pre-empted the exception by triggering a container reset (with process.exit(1)) before it reached the critical level. Although this prevented the exception, it was a hacky half-fix as it relied on the container being destroyed, and restarting cold, at the end of every burst of 15 requests. I was curious if this issue would present itself on other platforms, so put our efforts with AWS on pause and ventured over to Azure.

Azure

I ported the API over to Azure's serverless platform, Azure Functions, to find that the file descriptor leak was gone. The same codebase could now handle nearly 60 requests per minute (RPM) in the load tests. The issue with Azure was velocity of autoscaling.

Like in all the other serverless platforms, you can specify the maximum concurrency of requests, with respect to the instance. In Azure Functions, this is set from the host.json config:

"extensions": {
    "http": {
      "maxConcurrentRequests": 1,
    }
}

Having a maximum concurrency value of 1 means that for every inbound request, Azure will provision one dedicated compute instance (or 'host' in Azure-speak). This caters safely for the memory-hungriness of Chromium.

However, the Consumption Plan in Azure sets a hard limit on the rate at which new instances can be provisioned: 1 new instance per second. This is OK if you have high concurrency (100+), as many web APIs do, because it's a safer bet that enough time will elapse before you need that next instance. But with my ultra-low concurrency of 1, Azure needed to be constantly spinning up new instances to handle every uptick in traffic. Although instances are kept around for possible reuse in a 10-minute window after each invocation, during spikes of traffic the rate of new instances would easily exceed 1 per second and Azure would start returning 429: Too many requests.

To get Azure to increase this provisioning limit to 2 new instances per second, you must upgrade to the Premium Plan which introduces a minimum fixed cost of ~£200 per month. In fact, to build any production-ready API on Azure Functions, there seems to be no escaping the Premium Plan. For example, it's also a prerequisite to reserve any warm instances (to prevent cold starts) for your functions. I expected some costs for maintaining warm instances, but I couldn't justify a high monthly subscription for a hobby project without users at that point. We decided to continue shopping around, and see how far I could go with GCP.

GCP

Moving onto the third and last serverless offering, I began experimenting with Google Cloud Platform (GCP). Like AWS and Azure, GCP offers two primary ways of doing serverless:

  • Application Code Only. You don't have to worry about the underlying runtime or the environment, as the cloud provider manages these aspects. But you trade convenience for control.
  • Docker Container. You provide a Docker image which contains your application code along with the necessary runtime and dependencies. The cloud provider runs your function inside this container, which lets you control your exact setup.

Having used the 'application code only' approaches before in AWS and Azure, this time around I decided to create a custom Docker image for the API and deploy it through Google CloudRun. On reflection, Docker made much more sense for our use case and I should have been using it earlier. With a Docker container, I could now control the discrepancies between the development and production environments, and version control changes to the environment (there are dozens of Chromium dependencies) along with our application code. To fill the gap left by the runtime environment, I chose Express for its simplicity, speed, composable middleware, and as an opportunity to learn something new.

Free from the autoscaling rate limit imposed by Azure, the first round of load tests on CloudRun were able to hit ~200 RPM before a bottleneck of a different kind emerged.

Concurrency

The first bottleneck that emerged on the GCP deployment was related to concurrency.

To explain this, it helps to have a mental model of the architecture of the API. At this point, it's a single Express API that exposes two routes, which correspond to the controller and worker functions in the diagram below.

The controller here is responsible for routing the initial inbound request from the client, and processing the payload. The controller then invokes up to three workers to render the actual PDFs, which the controller finally stitches together and returns to the client as a single document.

Ok, now back to the concurrency problem. Because the API was deployed in a single container image (and a single CloudRun service) I could only set a single concurrency value for the whole API (containing both controllers and workers). That meant that even requests to the controller, which doesn't use Chromium at all, are given one dedicated instance each. This is a massive overallocation of our available compute, and so bursts of requests would quickly eat the GCP regional quota of 30 instances per region. I solved this first bottleneck by splitting out the controller from the workers, as separate Docker images and deployments. Controllers are then assigned a far higher concurrency, and a slimmed down image, so they can scale far more conservatively than the worker pool. This change allowed the API to reach 500 RPM.

Load Balancing: Unlocking 1,000 RPM and beyond

Rerunning our load tests, the API hit another spate of error codes at ~1,000 RPM. Similar to the autoscaling issue with Azure Functions, the acceleration of requests was outpacing the ability of CloudRun to provision new instances in the same region, and the familiar 429 error code was back. To unblock this issue, I deployed the same CloudRun image to several different GCP regions in Europe, and set up a global external load balancer to allocate the requests with the round robin policy, allowing each region to scale up more gracefully. The Load Balancer helped the API reach 1,300 RPM before the 429 error crept back in again.

These new failures were arising in edge cases where the load balancer picked a region to route to which was already overwhelmed. I wanted to implement some retry logic in front of the load balancer, for this I turned to Cloudflare Workers. Cloudflare Workers are lightweight functions that run on the V8 engine and are deployed at the edge of Cloudflare's network, providing low latency access. They are cheap and fast, making them ideal for high-volume API traffic. I wrote a simple retry handler to run in the worker, which sits in front of the load balancer:

export async function handleRequest(request: Request): Promise<Response> {
  const originalResponse = await fetch(request)
  if (originalResponse.status === 429) {
    // Uh oh! We're scaling too fast, let's try again in 3 seconds.
    await new Promise((resolve) => setTimeout(resolve, 3000))
    // Retry the request
    return await fetch(request)
  } else {
    return originalResponse
  }
}

And deployed it with Wrangler CLI as part of the same CI/CD workflow as the API itself.

The last handful of test failures cropped up at 1,500 RPM when the load balancer would naively retry the same region again, even after the previous request had been to that region and failed due to being overloaded. To handle this rare edge case, and to gain more nuanced control over the availability of individual regions, I decided to write a primitive load balancer to run in the Cloudflare Worker itself (using KV storage for persisting node availability data) and then shelved GCP Load Balancer:

This dealt with the issue by not re-attempting already overwhelmed regions. We were able to let the load tests scale error-free to 3,000 RPM for 10 minutes. Conceivably, it could have been pushed further but the costs of such load tests were becoming non-trivial! For our initial prototype, this felt like a good place to stop our quest for scale for now.