Migrating our infrastructure to Pulumi

At Airplane, we use Google Cloud Platform (GCP) and Kubernetes to run the infrastructure behind our product. Earlier this year, we kicked off a project to migrate all of the associated configuration to Pulumi, a tool that supports creating and managing infrastructure via code.

In the remainder of this post, we describe why we chose Pulumi, how we migrated our infrastructure to this system, and what we learned in the process.

Initial state

When the company started and had just 2 people, new infrastructure was created manually by clicking through the GCP console, and services were deployed in Kubernetes by running kubectl apply on the command-line. In addition, "production" and "staging" shared the same GCP project and Kubernetes cluster; the two were separated via naming conventions and lightweight isolation mechanisms like Kubernetes namespaces.

In retrospect, this seems pretty primitive, but it made sense at the time (and is what many companies do when starting out) since it's often overkill to do more before you have a product or any customers.

As we grew, the Kubernetes configs were migrated to kustomize and the apply steps were moved to our CI so that services would be auto-deployed as code changes were merged. This worked, but given the limitations of kustomize, it was still tedious to add new services, understand config diffs, and reference parameters from the non-Kubernetes parts of our infrastructure.

At the same time, GCP changes, e.g. to resize a database or configure a Kubernetes node pool, were still done manually in the console. Thankfully, these didn't happen too frequently, but when they did the person making them had to be extremely careful and use hacks like UI screenshots to capture the previous state in case a revert was needed later.

As the company expanded, we realized that we needed better isolation between our staging and production environments. We also wanted to adopt an "infrastructure as code" (IaC) framework so that infrastructure could be created and updated via GitHub pull requests instead of clicking around in UIs and running ad-hoc command-line tools.

In December 2021, we kicked off a project to address both of these goals by creating a completely separate staging environment from the ground up using an IaC framework. Assuming that this was successful, we would then migrate our production infrastructure later.

Why we chose Pulumi

The first decision to make was which IaC framework to use. HashiCorp's Terraform was the obvious choice since we had used it extensively at our former employers. Terraform is widely adopted because it has a lot of great features- it's open-source, has extensive documentation, and supports every major cloud provider (AWS, GCP, Azure, etc.) and lower-level infrastructure system (e.g., Kubernetes) that one could possibly want to use.

On the other hand, it definitely has some rough spots, the main one being that configuration is written in Terraform's custom HashiCorp Configuration Language (HCL). HCL is fine for simple things, but in large projects it can be challenging to modularize code and also express the equivalents of logical primitives like if statements and for loops. Moreover, its cloud-hosted version is somewhat limited; as a result, our previous employers had all run Terraform in a self-hosted mode, which requires some extra setup and custom, company-specific tooling for features like state management and secrets.

As an alternative to Terraform, we decided to try out a newer competitor in the space called Pulumi. Unlike Terraform, Pulumi allows its users to write configuration using general-purpose programming languages that our engineering team is already familiar with like Python, TypeScript, and Go. Thus, features like IDE support, code modularization, and logical primitives are supported for free in a natural, language-native way. In addition, Pulumi's cloud-hosted product is fully-featured and tightly integrated with the command-line tooling.

Of course, these "pros" could also be interpreted as "cons". Writing configuration in Python, for instance, instead of the more limited HCL allows users to create arbitrarily complex and non-deterministic logic if they aren't careful.

In our case, however, we decided that these Pulumi-specific features could really help us make the migration quickly and manage our infrastructure in the lowest-overhead way. We decided to give Pulumi a try and fall back to Terraform if things didn't work out.

Technical decisions

Once we decided to use Pulumi, we needed to make a number of decisions around the layout of our code and the language we would write it in. Before going into these, it's helpful to define some associated, Pulumi-specific terminology.

Terminology

At the top level, Pulumi code is organized into projects. A project is a directory that contains a runnable entrypoint and some configuration.

Each project is deployed to one or more stacks. A stack is a target that the code in a project is applied to. Although this target can be anything in general, one common practice is to have a stack per environment, e.g. "development", "staging", "production", etc. Each stack has a YAML config in the project directory that specifies the parameter values to use for the associated target.

The configuration for a specific project and stack is rolled out by running pulumi up. This operation, which is equivalent to terraform apply, compares the state of the infrastructure to the desired configuration in the code and then modifies the former, as needed, to keep the two in sync.

Language choice

Pulumi officially supports runtimes for Node.js (JavaScript/TypeScript), Python, Java, Go, and .NET (C#, F#, etc.). Our product code is split between TypeScript and Go, so these were the two main contenders. Ultimately, we picked the second because that's what most of our backend code is written in, and thus that choice would allow for a consistent developer experience when making combined backend and infrastructure changes.

Overall, we've been happy with the Go SDK and stuck with it throughout the project. Some of Pulumi's Go interfaces feel a little verbose and/or non-canonical, and some advanced features like Dynamic Providers aren't yet supported, but these haven't really gotten in the way for us. Picking a runtime based on language comfort and organizational consistency as opposed to SDK features is probably the way to go if you're making a similar decision.

Projects

Along with choosing a language, we needed to decide how to lay out our projects.

One simple choice would be to put everything in a single, giant project. This keeps the operations simple and is the easiest to configure, but, as with large Terraform modules, has performance downsides because each run of pulumi up needs to potentially scan and modify a large number of resources.

At the other end of the spectrum, one can go very granular, creating separate projects for each service and shared infrastructure component. This is the most performant for applying small changes, but becomes tedious for changes that affect multiple services and components. Sharing state between projects or modules also takes some effort in IaC systems; although both Terraform and Pulumi have features to help with this, too much granularity can make this sharing painful.

We ultimately decided on a middle ground with three medium-sized, shared projects, each with per-environment stacks:

infra: Configuration for lower-level components like networking, node pools, and Kubernetes control-planes.
services: Higher-level, service-specific configuration including service accounts, load balancers, and Kubernetes resources.
observability: Configuration for our monitors and SLOs (with more things, like dashboards, to potentially be added in the future).

The boundaries between these were drawn to match the typical scope of Airplane developer changes, which are usually updating either service code, infrastructure, or observability configs but rarely more than one category in this set. Separating services from lower-level infrastructure was also done to make trial-and-error in the latter layer easier; if the two were coupled, then service updates, which happen frequently, would routinely overwrite applied but unmerged infra changes.

In addition to these projects, we also created a separate package called lib with shareable library code. This library has functions, for instance, to set up generic services in Kubernetes, create storage buckets, and spin up databases using our company's standard naming and security conventions.

Final layout

We put the code for our projects into a single, top-level pulumi directory in our main monorepo. The following shows the high-level layout plus descriptions of the key files:

text

Note that each Pulumi project maps to a Go main package with a main.go entrypoint. These entrypoints, in turn, call helper functions defined in the other files in the package, which then call the helpers in the lib package as needed.

Our services entrypoint, for instance, looks like:

The createAPI function, which is defined in api.go in the same package, creates the service and its dependencies using stack-specific parameters plus our lib functions:

The config.Require function pulls in values from the stack-specific configuration file for the project. In the stage case, for instance, we have:

yaml

Although not shown in our examples for the sake of simplicity, the config file can also contain secret values that are encrypted with keys provided by either Pulumi or the user. Thus, things like DB passwords are stored in there too and checked in to our repo (in encrypted form) without having to maintain these in a separate secrets service.

The lib.CreateService* calls refer to some functions in the services.go file in the lib package. This is the layer that actually calls the Kubernetes-related functionality exposed by the Pulumi SDK:

The other code paths, i.e. to create our other infrastructure and services, follow a very similar pattern. Thus, the entrypoints are relatively simple and the implementations are able to use a lot of common code.

The migration process

Once we settled on a runtime and project layout, we came to the hardest part of the project- actually writing the code to create everything in our staging environment.

To get started, we relied heavily on Pulumi's import tool to generate code based on our existing infrastructure; in most cases, the import command to be run could be copied from the documentation for the associated resource type (example). It generally worked well, although the code that it created often needed some manual cleanup to support reuse across resources and stacks. For some resources, like our Kubernetes service manifests, we didn't bother importing and just started from scratch since this was cleaner and easier.

After several weeks of iteration, we successfully created staging stacks for the infra and services projects that included all of the infrastructure and services in this environment. Finally, we integrated Pulumi's GitHub Actions into our CI workflows so that code changes would get automatically applied when merged into our main branch.

The next step was to migrate production, which we started in May 2022. Fortunately, staging and production had been (manually) configured in a similar way and most resources were 1:1 across the two environments. So, rather than repeating the same import process, for most resources we just used the staging code as the base and manually modified this code until it could be applied cleanly in both environments.

In the easiest cases, the same code could be applied in both environments without any changes. In other cases, we had to parameterize certain values (e.g., the size of a DB, the number of cores to allocate to a container, etc.) and set them in the stack-specific YAML files so they could vary by environment. Finally, there were some resources for which re-using the staging code was extremely hard; for these, we just ran a separate pulumi import process for production and then used environment-specific code paths by checking the name of the stack, i.e.:

After several more weeks of iteration, we got production up to parity with stage, and also moved our Datadog monitors and SLOs into an observability project. In July 2022, after around 7 months of on and off effort, the migration was done!

Final thoughts

Reflecting back on the project, migrating to an IaC system has been a huge win by providing clean, consistent, and safe mechanisms for making infrastructure changes at Airplane. Doing this while we were still small (< 15 people) allowed us to make the switch with minimal disruption, and will save us a huge amount of pain as we get bigger.

Evaluating our choice of Pulumi over Terraform is a bit more nuanced. The main benefits have been:

Using the same, fully-featured language for both backend code and infrastructure configuration (Go in our case)
Having a simple way to configure and store environment-specific parameters, including secrets
Having an easy-to-use, cloud-hosted option for state and locking that's tightly integrated with the default tooling

At the same time, though, there are definitely some things we miss from the Terraform world:

Better documentation- Pulumi's docs are ok, but they're not as well-organized or helpful as the Terraform ones. In some cases, we've had to dig into the Pulumi code to really understand what certain API parameters actually mean.
More extensive provider catalog- Pulumi does have workarounds for this, e.g. using their pulumi-terraform-bridge tool, but they're not the same as having native support.
Ecosystem of third-party tools like Atlantis and Infracost.

Also, we have hit a few bugs and issues along the way. The Pulumi team has been very responsive (see this bug that we filed, for instance), but their tooling isn't as mature as Terraform's yet given that they started several years later.

On the whole, though, we feel that the pros have outweighed the cons; we've been extremely happy with Pulumi and would choose it again if we were starting from scratch.

If your organization has already invested a lot in Terraform configuration and is happy with it, it's probably not worth the effort to switch right now. But, if you're starting with only light usage of another IaC framework (or none at all), Pulumi is definitely worth exploring.

Join us!

Here at Airplane, we're building infrastructure for internal tools. If this kind of work sounds interesting to you, come join us! https://www.airplane.dev/careers