Category Archives: Configuration Management

The power of Jenkins JobDSL

At my current client we define all of our Jenkins jobs with Jenkins Job Builder so that we can have all of our job configurations version controlled. Every push to the version control repository which hosts the configuration files will trigger an update of the jobs in Jenkins so that we can make sure that the state in version control is always reflected in our Jenkins setup.

We have six jobs that are essentially the same for each of our projects but with only a slight change of the input parameters to each job. These jobs are responsible for deploying the application binaries to different environments.

Jenkins Job Builder supports templating, but our experience is that those templates usually will be quite hard to maintain in the long run. And since it is just YAML files, you’re quite restricted to what you’re able to do. JobDSL Jenkins job configurations on the other hand are built using Groovy. It allows you to insert actual code execution in your job configurations scripts which can be very powerful.

In our case, with JobDSL we can easily just create one common job configuration which we then just iterate over with the job specific parameters to create the necessary jobs for each environment. We also can use some utility methods written in Groovy which we can invoke in our job configurations. So instead of having to maintain similar Jenkins Job Builder configurations in YAML, we can do it much more consise with JobDSL.

Below, an example of a JobDSL configuration file (Groovy code), which generates six jobs according to the parameterized job template:

class Utils {
    static String environment(String qualifier) { qualifier.substring(0, qualifier.indexOf('-')).toUpperCase() }
    static String environmentType(String qualifier) { qualifier.substring(qualifier.indexOf('-') + 1)}
}

[
    [qualifier: "us-staging"],
    [qualifier: "eu-staging"],
    [qualifier: "us-uat"],
    [qualifier: "eu-uat"],
    [qualifier: "us-live"],
    [qualifier: "eu-live"]
].each { Map environment ->

    job("myproject-deploy-${environment.qualifier}") {
        description "Deploy my project to ${environment.qualifier}"
        parameters {
            stringParam('GIT_SHA', null, null)
        }
        scm {
            git {
                remote {
                    url('ssh://git@my_git_repo/myproject.git')
                    credentials('jenkins-credentials')
                }
                branch('$GIT_SHA')
            }
        }
        deliveryPipelineConfiguration("Deploy to " + Utils.environmentType("${environment.qualifier}"),
                Utils.environment("${environment.qualifier}") + " environment")
        logRotator(-1, 30, -1, -1)
        steps {
            shell("""deployment/azure/deploy.sh ${environment.qualifier}""")
        }
    }
}

If you need help getting your Jenkins configuration into good shape, just contact us and we will be happy to help you! You can read more about us at diabol.se.

Tommy Tynjä
@tommysdk

Puppet resource command

I have used puppet for several years but had overlooked the puppet resource command until now. This command uses the Puppet RAL (Resource Abstraction Layer, i.e. Puppet DSL) to directly interact with the system. What that means in plain language is that you can easily reverse engineer a system and get information about it directly in puppet format on the command line.

The basic syntax is: puppet resource type [name]

Some examples from my Mac will make it more clear:

Get info about a given resource: me

$ puppet resource user marcus
user { 'marcus':
 ensure => 'present',
 comment => 'Marcus Philip',
 gid => '20',
 groups => ['_appserveradm', '_appserverusr', '_lpadmin', 'admin'],
 home => '/Users/marcus',
 password => '*',
 shell => '/bin/bash',
 uid => '501',
}

Get info about all resources of a given type: users

$ puppet resource user
user { '_amavisd':
 ensure => 'present',
 comment => 'AMaViS Daemon',
 gid => '83',
 home => '/var/virusmails',
 password => '*',
 shell => '/usr/bin/false',
 uid => '83',
}
user { '_appleevents':
 ensure => 'present',
 comment => 'AppleEvents Daemon',
 gid => '55',
 home => '/var/empty',
 password => '*',
 shell => '/usr/bin/false',
 uid => '55',
}
...
user { 'root': ensure => 'present',
 comment => 'System Administrator',
 gid => '0',
 groups => ['admin', 'certusers', 'daemon', 'kmem', 'operator', 'procmod', 'procview', 'staff', 'sys', 'tty', 'wheel'],
 home => '/var/root',
 password => '*',
 shell => '/bin/sh',
 uid => '0',
}

One use case for this command could perhaps be to extract into version controlled code the state of a an existing server that I want to put under puppet management.

With the --edit flag the output is sent to a buffer that can be edited and then applied. And with attribute=value you can set attributes directly from command line.

However, I think I will only use this command for read operations. The reason for this is that I think that the main benefit of puppet and other tools of the same ilk is not the abstractions it provides in itself, but rather the capability to treat the infrastructure as code, i.e. under version control and testable.

The reason I have overlooked this command is probably because I’ve mainly been applying puppet to new, ‘greenfield’ servers, and because, as a software developer, I’m used to work from the model side rather than tinkering directly with the system on the command line. Anyway, now I have another tool in the belt. Always feel good.

Agile Configuration Management – intermezzo

Why do I need agile configuration management?

The main reason for doing agile configuration management is that it’s a necessary means too achieve agile infrastructure and architecture. When you have an agile architecture it becomes easy to make the correct choice for architecture.

Developers make decisions every day without realizing it. Solving that requirement by adding a little more code to the monolith and a new table in the DB is a choice, even though it may not be perceived as such. When you have an agile architecture, it’s easy to create (and maintain!) a new server or  middleware to adress the requirement in a better way.

Agile Configuration Management – part 1

On June 5 I held a lightning talk on Agile Configuration Management at the Agila Sverige 2014 conference. The 10 minute format does not allow for digging very deep. In this series of blog posts I expand on this topic.

The last year I have lead a long journey towards agile and devopsy automated configuration management for infrastructure at my client, a medium sized IT department. It’s part of a larger initiative of moving towards mature continuous delivery. We sit in a team that traditionally has had responsibility to maintain the test environments, but as part of the CD initiative we’ve been pushing to transform this to instead providing and maintaining a delivery platform for all environments.

The infrastructure part was initiated when we were to set up a new system and had a lot of machines to configure for that. Here was a golden window of opportunity to introduce modern configuration management (CM) automation tools. Note that nobody asked us to do this, it was just the only decent thing to do. Consequently, nobody told us what tools to use and how to do it.

The requirement was thus to configure the servers up to the point where our delivery pipeline implemented with Jenkins could deploy the applications, and to maintain them. The main challenge was that we need to support a large amount of java web applications with slightly different configuration requirements.

Goals

So we set out to find tools and build a framework that would support agile and devopsy CM. We’re building something PaaS-like. More specifically the goals we set up were:

  1. Self service model
    It’s important to not create a new silo. We want the developers to be able to get their work done without involving us. There is no configuration manager or other command or control function. The developers are already doing application CM, it’s just not acknowledged as CM.
  2. Infrastructure as Code
    This means that all configuration for servers are managed and versioned together as code, and the code and only the code can affect the configuration of the infrastructure. When we do this we can apply all the good practices we know well from software development such as unit testing, collaboration, diff, merge, etc.
  3. Short lead times for changes
    Short means minutes to hours rather than weeks. Who wants to wait 5 days rather than 5 minutes to see the effect of a change. Speeding up the feedback cycle is the most important factor for being able to experiment, learn and get things done.

Project phases

Our journey had different phases, each with their special context, goals and challenges.

1. Bootstrap

At the outset we address a few systems and use cases. The environments are addressed one after the other. The goal is to build up knowledge and create drafts for frameworks. We evaluate some, but not all tools. Focus is on getting something simple working. We look at Puppet and Ansible but go for the former as Ansible was very new and not yet 1.0. The support systems, such as the puppet master are still manually managed.

We use a centralized development model in this phase. There are few committers. We create a svn repository for the puppet code and the code is all managed together, although we luckily realize already now that it must be structured and modularized, inspired by Craig Dunns blog post.

2. Scaling up

We address more systems and the production environment. This leads to the framework expanding to handle more variations in use cases. There are more committers now as some phase one early adopters are starting to contribute. It’s a community development model. The code is still shared between all teams, but as outlined below each team deploy independently.

The framework is a moving target and the best way to not become legacy is to keep moving:

  • We increase automation, e.g. the puppet installations are managed with Ansible.
  • We migrate from svn to git.
  • Hiera is introduced to separate code and data for puppet.
  • Full pipelines per system are implemented in Jenkins.
    We use the Puppet dynamic environments pattern, have the Puppet agent daemon stopped and use Ansible to trigger a puppet agent run via the Jenkins job to be able to update the systems independently.

The Pipeline

As continuous delivery consultants we wanted of course to build a pipeline for the infrastructure changes we could be proud of.

Steps

  1. Static checks (Parse, Validate syntax, Compile)
  2. Apply to CI  (for all systems)
  3. Apply to TEST (for given system)
  4. Dry-run (–noop) in PROD (for given system)
  5. PROD Release notes/request generation (for given system)
  6. Apply in PROD (for given system)

First two steps are automatic and executed for all systems on each commit. Then the pipeline fork and the rest of the steps are triggered manually per system)

Complexity increases

Were doing well, but the complexity has increased. There is some coupling in that the code base is monolithic and is shared between several teams/systems. There are upsides to this. Everyone benefits from improvements and additions. We early on had to structure the code base and not have a big ball of mud that solves only one use case.

Another form of coupling is that some servers (e.g. load balancers) are shared which forces us to implement blocks in the Jenkins apply jobs so that they do not collide.

There is some unfamiliarity with the development model so there is some uncertainty on the responsibilities – who test and deploy what, when? Most developers including my team are also mainly ignorant on how to test infrastructure.

Looking at our pipeline we could tell that something is not quite all right:

Puppet Complex Pipeline in Jenkins

 

In the next part of this blog series we will see how we addressed these challenges in phase 3: Increase independence and quality.

Gist: Ansible 1.3 Conditional Execution Examples

I just published a gist on Ansible 1.3 Conditional Execution

It is a very complete example with comments. I find the conditional expressions to be ridiculously hard to get right in Ansible. I don’t have a good model of what’s going on under the surface (as I don’t know Python) so I often get it wrong.

What makes it even harder is that there has been at least three different variants over the course from version 0.7 to 1.3. Now ‘when’ seems to be the recommended one, but I used to have better luck with the earlier versions.

One thing that makes it hard is that the type of the variable is very important, and it’s not obvious what that is. It seems it may be interpreted as a string even if defined as False. The framework doesn’t really help you. I think a language like this should be able to ‘do what I mean’.

Here is the official Ansible docs on this.

Puppet change promotion and code base design

I have recently introduced puppet at a medium sized development organizations. I was new to puppet when I started, but feel like a seasoned and scarred veteran by now. Here’s my solution for puppet code base design and change promotion.

Like any change applied to a system we want to have a defined pipeline to production that includes testing. I think the problem is not particular to the modern declarative CM tools like puppet, it’s just that they makes the problem a lot more explicit compared to manual CM.

Solution Summary

We have a number of environments: CI, QA, PROD, etc. We use a puppet module path with $environment variable to be able to update these environments independently.

We have built a pipeline in Jenkins that is triggered by commits to the svn repo that contains the Puppet (and Hiera) code. The initial commit stage jobs are all automatically triggered as long as the preceding step is OK, but QA and PROD application is manually triggered.

The steps in the commit stage is:

  1. Compile
    1. Update the code in CI environment on puppet master from svn.
    2. Use the master to parse the manifests and validate the erb templates changed in this commit.
    3. Use the master to compile all nodes in CI env.
  2. Apply to CI environment (with puppet agent --test)
  3. Apply to DEV environment
  4. Apply to Test (ST) environment

The compile sub-step is run even if the parse or validate failed, to gather as much info as possible before failing.

Jenkins puppet pipeline visualized in Diabols new Delivery Pipeline plugin
Jenkins puppet pipeline visualized in Diabols new Delivery Pipeline plugin

The great thing about this is that the compile step will catch most problems with code and config before they have any chance of impacting a system.

Noteworthy is also that we have a noop run for prod before the real thing. Together with the excellent reporting facilities in Foreman, this allows me to see with a high fidelity exactly what changes that will be applied, line by line diff if needed, and what services that will be restarted.

Triggering agent runs

The puppet agents are not daemonized. We didn’t see any important advantage in having them run as daemons, but the serious disadvantages of having no simple way to prevent application of changes before they are tested (with parse and compile).

The agent runs are triggered using Ansible. It may seem strange to introduce another CM tool to do this, but Ansible is a really simple and powerful tool to run commands on a large set of nodes. And I like YAML.

Also, Puppet run is deprecated with the suggestion to use MCollective instead. However, that involves setting up a message queue, i.e. another middleware to manage and monitor. Every link in your tool chain has to carry it’s own weight (and more) and the weight of Ansible is basically zero, and for MQ > 0.

We also use Ansible to install the puppet agents. Funny bootstrapping problem here: You can’t install puppet without puppet… Again, Ansible was the simplest solution for us since we don’t manage the VMs ourselves (and either way, you have to be able to easily update the VMs, which takes a machinery of it’s own if it’s to be done the right way).

External DMZ note

Well, all developers loves network security, right? Makes your life simple and safe… Anyway, I guess it’s just a fact of life to accept. Since, typically, you do not allow inwards connections from your external DMZ, and since it’s the puppet agent that pulls, we had to set up an external puppet master in the external DMZ (with rsync from internal of puppet modules and yum repo) that manages the servers in external DMZ. A serious argument for using a push based tool like Ansible instead of puppet. But for me, puppet wins when you have a larger CM code base. Without the support of the strict checking of puppet we would be lost. But I guess I’m biased, coming from statically typed programming languages.

Code organization

We use the Foreman as an ENC, but the main use of it is to get a GUI for viewing hosts and reports. We have decided to use a puppet design pattern where the nodes are only mapped to one or a few top level role classes in Foreman, and the details is encapsulated inside the role class, using one or more layers of puppet classes. This is inspired by Craig Dunn’s Roles and Profiles pattern.

Then we use Hiera yaml files to put in most of the parameters, using the automatic-parameter-lookup heavily.

This way almost everything is in version control, which makes refactoring and releasing a lot easier.

But beware, you cannot use the future parser of puppet with Foreman as of now. This is needed for the new puppet lambda functions. This was highly annoying, as it prevents me from designing the hiera data structure in the most logical way and then just slicing it as necessary.

The create_resources function in puppet partly mitigates this, but it’s strict on the parameters, so if the data structure contains a key that doesn’t correspond to a parameter of the class, it fails.

Releasable Units

One of the questions we faced was how and whether to split up the puppet codebase into separately releasable components. Since we are used to trunk based development on a shared code base, we decided that is was probably easier to manage everything together.

Local testing

Unless you can quickly test your changes locally before committing, the pipeline is gonna be red most of the time. This is solved in a powerful and elegant way using Vagrant. Strongly recommended. In a few seconds I can test a minor puppet code change, and in a minute I can test the full puppet config for a node type. The box has puppet and the Vagrantfile is really short:

Vagrant.configure("2") do |config|
  config.vm.box = "CentOS-6.4-x86_64_puppet-3_2_4-1"
  config.vm.box_url = "ftp://ftptemp/CentOS-6.4-x86_64_puppet-3_2_4-1.box"

  config.vm.synced_folder "vagrant_puppet", "/home/vagrant/.puppet"
  config.vm.synced_folder "puppet", "/etc/puppet"
  config.vm.synced_folder "hieradata", "/etc/puppet/hieradata"

  config.vm.provision :puppet do |puppet|
    puppet.manifests_path = "manifests"
    puppet.manifest_file  = "site.pp"
    puppet.module_path = "modules"
  end
end

As you can see it’s easy to map in the hiera stuff that’s needed to be able to test the full solution.

Foot Notes

It’s been suggested in the DevOps community that you should treat servers as cattle, not pets. At the place where I implemented this, we haven’t yet reached that level of maturity. This may somewhat impact the solution, but large parts of it would be the same.

A while ago I posted Puppet change promotion – Good practices? in LinkedIn DevOps group. The solution I described here is what I came up with.

Resources

Environment based DevOps Deployment using Puppet and Mcollective
Advocates master less puppet
The NBN Puppet Journey
De-centralise and Conquer: Masterless Puppet in a Dynamic Environment

Code Examples

Control script

This script is used from several of the Jenkins jobs.

 #!/bin/bash
set -e  # Exit on error

function usage {
echo "Usage: $0 -r  (-s|-p|-c|-d)
example:
$0 -pc -r 123
$0 -d -r 156
-r The svn revision to use
-s Add a sleep of 60 secs after svn up to be sure we have rsync:ed the puppet code to external puppet
-p parse the manifests changed in
-c compile all hosts in \$TARGET_ENV
-d Do a puppet dry-run (noop) on \$TARGET_HOSTS

Updates puppet modules from svn in \$TARGET_ENV on puppet master at the
beginning of run, and reverts if any failures.

The puppet master is used for parsing and compiling.

This scrips relies on environment variables:
* \$TARGET_ENV for svn
* \$TARGET_HOSTS for dry-run
";
}

if [ $# -lt 1 ]; then
usage; exit 1;
fi

# Set options
sleep=false; parse=false; compile=false; dryrun=false;
while getopts "r:spcd" option; do
case $option in
r) REVISION="$OPTARG";;
s) sleep=true;;
p) parse=true;;
c) compile=true;;
d) dryrun=true;;
*) echo "Unknown parameter: $opt $OPTARG"; usage; exit 1;;
esac
done
shift $((OPTIND - 1))

if [ "x$REVISION" = "x" ]; then
usage; exit 1;
fi

# This directory is updated by a Jenkins job
cd /opt/tools/ci-jenkins/jenkins-home/common-tools/scripts/ansible/

# SVN UPDATE ##################################################################
declare -i OLD_SVN_REV
declare -i NEXT_SVN_REV
## Store old svn rev before updating so we can roll back if not OK
OLD_SVN_REV=`ssh -T admin@puppetmaster svn info /etc/puppet/environments/${TARGET_ENV}/modules/| grep -E '^Revision:' | cut -d ' ' -f 2`
echo $'\n######### ######### ######### ######### ######### ######### ######### #########'
echo "Current svn revision in ${TARGET_ENV}: $OLD_SVN_REV"
if [ "$OLD_SVN_REV" != "$REVISION" ]; then
# We could have more than on commit since last run (even if we use post-commit hooks)
NEXT_SVN_REV=${OLD_SVN_REV}+1
# Update Puppet master
ansible-playbook puppet-master-update.yml -i hosts --extra-vars="target_env=${TARGET_ENV} revision=${REVISION}"
# SLEEP #############################
$sleep {
echo 'Sleep for a minute to be sure we have rsync:ed the puppet code to external puppet...'
sleep 60
}
else
echo 'Svn was already at required revision. Continuing...'
NEXT_SVN_REV=$REVISION
fi

# Final result ################################################################
declare -i RESULT
RESULT=0
set +e  # Don't exit on error. Collect the errors instead.

# PARSE #######################################################################
$parse {
# Parse manifests ###################
## Get only the paths to the manifests that was changed (to limit the number of parses).
MANIFEST_PATH_LIST=`svn -q -v --no-auth-cache --username $JENKINS_USR --password $JENKINS_PWD -r $NEXT_SVN_REV:$REVISION \
  log http://scm.company.com/svn/puppet/trunk \
  | grep -F '/puppet/trunk/modules' | grep -F '.pp' |  grep -Fv '   D' | cut -c 28- | sed 's/ .*//g'`
echo $'\n######### ######### ######### ######### ######### ######### ######### #########'
echo $'Manifests to parse:'; echo "$MANIFEST_PATH_LIST"; echo "";
for MANIFEST_PATH in $MANIFEST_PATH_LIST; do
# Parse this manifest on puppet master
ansible-playbook puppet-parser-validate.yml -i hosts --extra-vars="manifest_path=/etc/puppet/environments/${TARGET_ENV}/modules/${MANIFEST_PATH}"
RESULT+=$?
done

# Check template syntax #############
TEMPLATE_PATH_LIST=`svn -q -v --no-auth-cache --username $JENKINS_USR --password $JENKINS_PWD -r $NEXT_SVN_REV:$REVISION \
  log http://scm.company.com/svn/platform/puppet/trunk \
  | grep -F '/puppet/trunk/modules' | grep -F '.erb' |  grep -Fv '   D' | cut -c 28-`
echo $'\n######### ######### ######### ######### ######### ######### ######### #########'
echo $'Templates to check syntax:'; echo "$TEMPLATE_PATH_LIST"; echo "";
for TEMPLATE_PATH in $TEMPLATE_PATH_LIST; do
erb -P -x -T '-' modules/${TEMPLATE_PATH} | ruby -c
RESULT+=$?
done
}

# COMPILE #####################################################################
$compile {
echo $'\n######### ######### ######### ######### ######### ######### ######### #########'
echo "Compile all manifests in $TARGET_ENV"
ansible-playbook puppet-master-compile-all.yml -i hosts --extra-vars="target_env=${TARGET_ENV} puppet_args=--color=false"
RESULT+=$?
}

# DRY-RUN #####################################################################
$dryrun {
echo $'\n######### ######### ######### ######### ######### ######### ######### #########'
echo "Run puppet in dry-run (noop) mode on $TARGET_HOSTS"
ansible-playbook puppet-run.yml -i hosts --extra-vars="hosts=${TARGET_HOSTS} puppet_args='--noop --color=false'"
RESULT+=$?
}

set -e  # Back to default: Exit on error

# Revert svn on puppet master if there was a problem ##########################
if [ $RESULT -ne 0 ]; then
echo $'\n######### ######### ######### ######### ######### ######### ######### #########'
echo $'Revert svn on puppet master due to errors above\n'
ansible-playbook puppet-master-revert-modules.yml -i hosts --extra-vars="target_env=${TARGET_ENV} revision=${OLD_SVN_REV}"
fi

exit $RESULT

Ansible playbooks

The ansible playbooks called from bash are simple.

puppet-master-compile-all.yml

---
# usage: ansible-playbook puppet-master-compile-all.yml -i hosts --extra-vars="target_env=ci1 puppet_args='--color=html'"

- name: Compile puppet catalogue for all hosts for a given environment on the puppet master
  hosts: puppetmaster-int
  user: ciadmin
  sudo: yes      # We need to be root
  tasks:
    - name: Compile puppet catalogue for in {{ target_env }}
      command: puppet master {{ puppet_args }} --compile {{ item }} --environment {{ target_env }}
      with_items: groups['ci1']

puppet-run.yml

---
# usage: ansible-playbook puppet-run.yml -i hosts --forks=12 --extra-vars="hosts=xyz-ci puppet_args='--color=false'"

- name: Run puppet agents for {{ hosts }}
  hosts: $hosts
  user: cipuppet
  tasks:
    - name: Trigger puppet agent run with args {{ puppet_args }}
      shell: sudo /usr/bin/puppet agent {{ puppet_args }} --test || if [ $? -eq 2 ]; then echo 'Notice - There were changes'; exit 0; else exit $?; fi;
      register: puppet_agent_result
      changed_when: "'Notice - There were changes' in puppet_agent_result.stdout"

Ansible inventory file (hosts)

The hosts file is what triggers the ansible magic. Here’s an excerpt.

# BUILD SERVERS ###############################################################
[puppetmaster-int]
puppet.company.com

[puppetmaster-ext]
extpuppet.company.com

[puppetmasters:children]
puppetmaster-int
puppetmaster-ext

[puppetmasters:vars]
puppet_args=""

# System XYZ #######################################################################
[xyz-ci]
xyzint6.company.com
xyzext6.company.com

# PROD
[xyz-prod-ext]
xyzext1.company.com

[xyz-prod-ext:vars]
puppet_server=extpuppet.company.com

[xyz-prod-int]
xyzint1.company.com

[xyz-prod:children]
xyz-prod-ext
xyz-prod-int

...

# ENVIRONMENT AGGREGATION #####################################################
[ci:children]
xyz-ci
pqr-ci

[prod:children]
xyz-prod
pqr-prod

[all_envs:children]
dev
ci
st
qa
prod

# Global defaults
[all_envs:vars]
puppet_args=""
puppet_server=puppet.company.com

Marcus Philip
@marcus_phi

Test data – part 1

When you run an integration or system test, i.e. a test that spans one or more logical or physical boundaries in the system, you normally need some test data, as most non­trivial operations depends on some persistent state in the system. Even if the test tries to follow the advice of favoring to verify behavior over state, you may still need specific input to even achieve a certain behavior. For example, if you want to test an order flow for a specific type of product, you must know how to add a product of that type to the basket, e.g. knowing a product name.

But, and here is the problem, if you don’t have strict control of that data it may change over time, so suddenly your test will fail.

When unit testing, you’ll want to use mocks or fakes for dependencies (and have well factored code that lets you easily do that), but here I’m talking about tests where you specifically want to use the real dependency.

Basically, there are only two robust ways to manage test data:

  1. Each tests creates the data it needs.
  2. Create a managed set of data that covers all of your test needs.

You can also use a combination of the two.

For the first strategy, either you have an idempotent approach so that you just ensure a certain state, or, you create and delete the data for each run. In some cases you can use transactions to be able to safely parallelize your tests and not modify persistent state. Just open one at the start of the test and then abort it instead of committing at the end. Obviously you cannot test functionality that depends on transactions this way.

The second strategy is a lot easier if you already have a clear separation between reference data, application data and transactional data.

By reference data I mean data that change with very low frequency and that often is of limited size and has a list or key/value structure. Examples could be a list of supported languages or zip code to address lookup. This should be fairly easy to keep in one authoritative, version controlled location, either in bulk or as deltas.

The term application data is not as established as reference data. It is data that affects the behavior of the application. It is not modified by normal end user actions, but is continuously modified by developers or administrators. Examples could be articles in a CMS or sellable products in an eCommerce website. This data is crucial for tests. It’s typically the data that tests use as input or for assertions.

The challenge here is to keep the production data and the test data set in synch. Ideally there should be a process that makes it impossible (or at least hard) to update the former without updating the second. However, there are often many complicating factors: the data can be in another system owned by another team and without a good test double, the data can be large, or it can have complex relationships or dependencies that sometimes very few fully grasp. Often it is managed by non­technical people so their tool set, knowledge and skills are different.

Unit or component tests can often overcome these challenges by using a strategy to mock systems or create arbitrary test data and verify behavior and not exact state, but acceptance tests cannot do that. We sometimes need to verify that a specific product can be ordered, not a fictional one created by the test.

Finally, transactional data is data continuously created by the application. It is typical large, fast growing and of medium complexity. Example could be orders, article comments and logs.

One challenge here is how to handle old, ‘obsolete’ data. You may have data stored that is impossible to generate in the current application because the business rules (and the corresponding implementation) have changed. For the test data it means you cannot use the application to create the test data if that was you strategy. Obviously, this can make the application code more complicated, and for the test code, hopefully you have it organized so it’s easy to correlate the acceptance test to the changed business rule and easy to change them accordingly. The tests may get more complicated because there can now e.g. be different behavior for customers with an ‘old’ contract. This may be hard for new developers in the team that only know of the current behavior of the app. You may even have seemingly contradicting assertions.

Another problem can be the sheer size. This can be remediated by having a strategy for aggregating, compacting and/or extracting data. This is normally easy if you plan for it up front, but can be hard when your database is 100 TB. I know that hardware is cheap, but having a 100 TB DB is inconvenient.

The line between application data and transactional data is not always clear cut. For example when an end user performs an action, such as a purchase, he may become eligible for certain functionality or products, thus having altered the behavior of the application. It’s still a good approach though to keep the order rows and the customer status separated.

I hope to soon write more on the tougher problems in automated testing and of managing test data specifically.

Marcus Philip
@marcus_phi

Distributed version control systems

Distributed version control systems (DVCS) has been around for many years already, and is increasing in popularity all the time. There are however many projects that are still using a traditional version control system (VCS), such as Subversion. I have until recently, only been working with Subversion as a VCS. Subversion sure has its flaws and problems but mostly got the job done over the years I’ve been working with it. I started contributing to the JBoss ShrinkWrap project early this spring, where they use a DVCS in form of Git. The more I’ve been working with Git, the more I have been aware of the problems which are imposed by Subversion. The biggest transition for me has been to adopt the new mindset that DVCS brings. Suddenly I realized that my daily work has many many times been influenced on the way the VCS worked, rather than doing things the way that feels natural for me as a developer. I think this is one of the key benefits with DVCS, and I think you start being aware of this as soon as you start using a DVCS.

While a traditional VCS can be sufficent in many projects, DVCSs brings new interesting dimensions and possibilites to version control.

What is a distributed version control system?
The fundamental of a DVCS is that each user keeps an own self-contained repository on his/her computer. There is no need to have a central master repository, even if most projects have one, e.g. to allow continuous integration. This allows for the following characteristics:

* Rapid startup. Install the DVCS of choice and start committing instantly into your local repository.
* As there is no need for a central repository, you can pull individual updates from other users. They do not have to be checked in into a central repository (even if you use one) like in Subversion.
* A local repository allows you the flexibility to try out new things without the need to send them to a central repository and make them available to others just to get them under version control. E.g. it is not necessary to create a branch on a central server for these kind of operations.
* You can select which updates you wish to apply to your repository.
* Commits can be cherry-picked, which means that you can select individual patches/fixes from users as you like
* Your repository is available offline, so you can check in, view project history etc. regardless of your Internet connection status.
* A local repository allows you to check in often, even though your code might not even compile, to create checkpoints of your current work. This without interfering with other peoples work.
* You can change history, modify, reorder and squash commits locally as you like before other users get access to your work. This is called rebasing.
* DVCSs are far more fault-tolerant as there are many copies of the actual repository available. If a central/master repository is used it should be backed up though.

One of the biggest differences between Git and Subversion which I’ve noticed is not listed above and is the speed of the version control system. The speed of Git has really been blowing me away and in terms of speed, it feels like comparing a Bugatti Veyron (Git) with an old Beetle (Subversion). A project which would take minutes to download from a central Subversion repository is literally taking seconds with Git. Once, I actually had to investigate that my file system acutally contained all the files Git told me it downloaded, as it went so incredibly fast! I want to emphasize that Git is not only faster when downloading/checking out source code the first time, it also applies to commiting, retrieving history etc.

Squashing commits with Git
To be able to change history is something I’ve longed for in all these years working with Subversion. With a DVCS, it is possible! When I’ve been working on a new feature for instance, previously I’ve ususally wanted to commit my progress (as checkpoints, mentioned above) but in a Subversion environment this would screw things up for other team members. When I work with Git, it allows me the freedom to do what I’ve wanted to do during all these years, committing small incremental changes to the code base, but without disturbing other team members in their work. For example, I could add a new method to an interface, commit it, start working on the implementation, commit often, work some more on the implementation, commit some more stuff, then realize that I need to rethink some of the implementation, revert a couple of commits, redo the implementation, commit etc. All this without disturbing my colleagues working on the same code base. When I feel like commiting my work, I don’t necessarily want to bring in all small commits I’ve made at development time, e.g. just adding javadoc to a method in a commit. With Git I can do something called squash, which means that I can bunch commits together, e.g. bunch my latest 5 commits together to a single one, which I then can share with other users. I can even modify the commit message, which I think is a very neat feature.

Example: Squash the latest 5 commits on the current working tree
$ git rebase -i HEAD~5

This will launch a VI editor (here I assume you are familiar with it). Leave the first commit as pick, change the rest of the signatures to “squash”, such as:

pick 79f4edb Work done on new feature
pick 032aab2 Refactored
pick 7508090 More work on implementation
pick 368b3c0 Began stubbing out interface implementation
pick c528b95 Added new interface method

to:

pick 79f4edb Work done on new feature
squash 032aab2 Refactored
squash 7508090 More work on implementation
squash 368b3c0 Began stubbing out interface implementation
squash c528b95 Added new interface method

On the next screen, delete or comment all lines you don’t want and add a more proper commit message:

# This is a combination of 5 commits.
# The first commit's message is:
Added new interface method
# This is the 2nd commit message:
Began stubbing out interface implementation
...

to:

# This is a combination of 5 commits.
# The first commit's message is:
Finished work on new feature
#Added new interface method
# This is the 2nd commit message:
#Began stubbing out interface implementation
...

Save to execute the squash. This will leave you with a single commit with the message you provided. Now you can just share this single commit with other users, e.g. via push to the master repository (if used).

Another interesting aspect of DVCSs is that if you use master repository, it won’t get hit that often since you execute your commits locally before squashing things together and send them upstream. This makes DVCSs more attractive from a scalability point of view.

Summary
A DVCS does not enforce you to have a central repository and every user has its own local repository with full history. Users can work and commit locally before sharing code with other users. If you haven’t tried out DVCS yet, do it! It is actually as easy as stated earlier: Download, install and create your first repository! The concepts of DVCS may be confusing for a non-DVCS user at first, but there are a lot of tutorials out there and “cheat sheets” which covers the most basic (and more advanced) tasks. You will soon discover many nice features with the DVCS of your choice, making it harder and harder to go back to a traditional VCS. If you have experience from DVCSs, please share your experiences!

Tommy Tynjä
@tommysdk