From zero to staging and back

Update (November 2016): There’s now an updated and revised version of this article.


The primary use of a staging environment is to test all installation/configuration/migration scripts and procedures, before they are applied to production environment. This ensures that all major and minor upgrades to the production environment will be completed reliably without errors, in minimum time.
Wikipedia

The first major task I took on after joining the Werkzeugschmiede team at Jimdo was building a staging environment for Wonderland, our in-house PaaS for microservices. Having a pre-production environment for testing prior to deploying to production is invaluable. It’s a safety net that makes the whole deployment process a lot less scary. We knew that it would give us more confidence to experiment, fix bugs, and implement new features. Given all these benefits, creating a staging environment for Wonderland was long over due. Here’s how we did it.

Pair programming

From the beginning, we’ve employed pair programming. Building a staging environment that mirrors production as closely as possible, and doing this with Paul who knows Wonderland inside out, was a great way to learn about the system and its components. I was able to ask questions when something was unclear and, at the same time, contribute my own ideas whenever I felt like it. This way, we created a fast feedback loop that not only helped me find my way through Wonderland, but also learn about its creators and their modus operandi.

Taken all together, I highly recommend pairing for onboarding new team members – even if you don’t have the luxury of building a production-like environment from scratch.

One VPC per environment

Wonderland’s infrastructure is hosted on AWS. Rather than using a single VPC for both production and staging, we agreed to operate a dedicated VPC per environment (via separate AWS accounts). This setup is an effective way to isolate environments from one another. Most importantly, it prevents changes done in staging – whether intentionally or by mistake – from affecting production. Other advantages of having different VPCs are:

Working with multiple AWS accounts makes credential management a bit more involved, though. That’s why we’re using awsenv, a tool by my coworker Knut, to quickly switch between accounts.

Besides AWS, we also created separate accounts for all third-party services, like Papertrail and Datadog. If we do it, we do it right.

Automate all the things

We spent a lot of time automating the setup of our staging environment. We managed to get to the point where we were able to run make stage in our github.com/Jimdo/wonderland repository and Ansible would take care of everything, from bootstrapping our ECS cluster to provisioning Jenkins – our central state enforcer – to running essential Docker containers.

To achieve that, we took the existing Ansible playbooks and CloudFormation templates for production and adapted them for use in staging. This meant that we had to:

Even after sorting this out, there’s only one way to find out whether our automation actually works as expected: creating staging from scratch, again and again.

Destroy all the things

In addition to make stage, we also implemented the inverse operation, make destroy-stage, to completely destroy staging. This boils down to deleting all CloudFormation stacks as well as all other resources created by Ansible – in reverse order of creation. Tearing down CloudFormation stacks is usually straightforward. However, shelling out to the AWS CLI in Ansible can quickly lead to dependencies that are hard to remove. And even if Ansible does provide a specific AWS module, there’s no guarantee that the “absent” state is implemented properly (I’m looking at you, iam module).

Once make destroy-stage did what we wanted it to do, we were able to bootstrap staging from scratch. This in turn allowed us to verify that our infrastructure code produces the correct results when starting from a clean system.

To further automate things, I created one Jenkins jobs in prod to automatically destroy staging every Friday night and another one to rebuild it on Monday morning. This way, we’ll gain even more confidence in our code and, as a nice side effect, save a bit of money.

Some drawbacks

While I’m quite happy with what we’ve achieved so far, there’s still room for improvement, namely:

Tagged under: Wonderland, AWS, Ansible, CloudFormation