Agile Configuration Management – part 1

On June 5 I held a lightning talk on Agile Configuration Management at the Agila Sverige 2014 conference. The 10 minute format does not allow for digging very deep. In this series of blog posts I expand on this topic.

The last year I have lead a long journey towards agile and devopsy automated configuration management for infrastructure at my client, a medium sized IT department. It’s part of a larger initiative of moving towards mature continuous delivery. We sit in a team that traditionally has had responsibility to maintain the test environments, but as part of the CD initiative we’ve been pushing to transform this to instead providing and maintaining a delivery platform for all environments.

The infrastructure part was initiated when we were to set up a new system and had a lot of machines to configure for that. Here was a golden window of opportunity to introduce modern configuration management (CM) automation tools. Note that nobody asked us to do this, it was just the only decent thing to do. Consequently, nobody told us what tools to use and how to do it.

The requirement was thus to configure the servers up to the point where our delivery pipeline implemented with Jenkins could deploy the applications, and to maintain them. The main challenge was that we need to support a large amount of java web applications with slightly different configuration requirements.

Goals

So we set out to find tools and build a framework that would support agile and devopsy CM. We’re building something PaaS-like. More specifically the goals we set up were:

  1. Self service model
    It’s important to not create a new silo. We want the developers to be able to get their work done without involving us. There is no configuration manager or other command or control function. The developers are already doing application CM, it’s just not acknowledged as CM.
  2. Infrastructure as Code
    This means that all configuration for servers are managed and versioned together as code, and the code and only the code can affect the configuration of the infrastructure. When we do this we can apply all the good practices we know well from software development such as unit testing, collaboration, diff, merge, etc.
  3. Short lead times for changes
    Short means minutes to hours rather than weeks. Who wants to wait 5 days rather than 5 minutes to see the effect of a change. Speeding up the feedback cycle is the most important factor for being able to experiment, learn and get things done.

Project phases

Our journey had different phases, each with their special context, goals and challenges.

1. Bootstrap

At the outset we address a few systems and use cases. The environments are addressed one after the other. The goal is to build up knowledge and create drafts for frameworks. We evaluate some, but not all tools. Focus is on getting something simple working. We look at Puppet and Ansible but go for the former as Ansible was very new and not yet 1.0. The support systems, such as the puppet master are still manually managed.

We use a centralized development model in this phase. There are few committers. We create a svn repository for the puppet code and the code is all managed together, although we luckily realize already now that it must be structured and modularized, inspired by Craig Dunns blog post.

2. Scaling up

We address more systems and the production environment. This leads to the framework expanding to handle more variations in use cases. There are more committers now as some phase one early adopters are starting to contribute. It’s a community development model. The code is still shared between all teams, but as outlined below each team deploy independently.

The framework is a moving target and the best way to not become legacy is to keep moving:

  • We increase automation, e.g. the puppet installations are managed with Ansible.
  • We migrate from svn to git.
  • Hiera is introduced to separate code and data for puppet.
  • Full pipelines per system are implemented in Jenkins.
    We use the Puppet dynamic environments pattern, have the Puppet agent daemon stopped and use Ansible to trigger a puppet agent run via the Jenkins job to be able to update the systems independently.

The Pipeline

As continuous delivery consultants we wanted of course to build a pipeline for the infrastructure changes we could be proud of.

Steps

  1. Static checks (Parse, Validate syntax, Compile)
  2. Apply to CI  (for all systems)
  3. Apply to TEST (for given system)
  4. Dry-run (–noop) in PROD (for given system)
  5. PROD Release notes/request generation (for given system)
  6. Apply in PROD (for given system)

First two steps are automatic and executed for all systems on each commit. Then the pipeline fork and the rest of the steps are triggered manually per system)

Complexity increases

Were doing well, but the complexity has increased. There is some coupling in that the code base is monolithic and is shared between several teams/systems. There are upsides to this. Everyone benefits from improvements and additions. We early on had to structure the code base and not have a big ball of mud that solves only one use case.

Another form of coupling is that some servers (e.g. load balancers) are shared which forces us to implement blocks in the Jenkins apply jobs so that they do not collide.

There is some unfamiliarity with the development model so there is some uncertainty on the responsibilities – who test and deploy what, when? Most developers including my team are also mainly ignorant on how to test infrastructure.

Looking at our pipeline we could tell that something is not quite all right:

Puppet Complex Pipeline in Jenkins

 

In the next part of this blog series we will see how we addressed these challenges in phase 3: Increase independence and quality.