Experiences in Erlang

At Heroku, Erlang powers a number of services, probably the highest-performance one being the Heroku router, which routes requests from customer browsers to the Dyno that an app is running on, processing millions of requests a day, and directing them to the customer’s running code.

We’re building a new service, in Erlang, it’s currently sitting at around 2500 lines of code including tests, and is designed to handle several thousand requests per second.

It’s been a bit tricky (as it always is) getting to grips with how applications are designed and built in a new language, but out of some early primordial soup, a clear shape has emerged, and I’m fairly happy with how it’s looking now.

As part of the productisation process for getting the application deployed, I was adding support for logging some stats from Redis.

We need to monitor the behaviour of our datastore (Redis) while it’s running, to make sure we’re not utilising more memory than we expect.

I looked at the INFO command in Redis, and it returns the following structure:

# Section 1\r\n
# Section 2\r\n

It’s clear that we need to maintain the name of the current section, list of the sections we’d seen and the items in the current section.

This can easily be stored as an Erlang tuple of the form {CurrentName, PreviousSections, CurrentSection}, with PreviousSections being a list of the name and key/value pairs, and the CurrentSection being the list of key/value pairs we’ve seen in the current code.

So, I looked at the lines, and tried to imagine what we should expect the state to be processing each line of the Redis response.

Rough sketch

# Server\r\n                                   contains # so CurrentSection = Server
redis_server:3.1.999\r\n                       (CurrentSection, [], [])
redis_git_sha1:084a59c3\r\n                    (CurrentSection, [], [{redis_server, 3.1.199}])
\r\n                                           (CurrentSection, [], [{redis_git_sha1:084a59c3}, {redis_server, 3.1.199}])
# Clients                                      (nil, [{redis_git_sha1:084a59c3}, {redis_server, 3.1.199}], [])

This allowed me to write tests to get the desired behaviour right…

    ?assertEqual({<<"Server">>, [], []},
                 redo_stats:parse_line(<<"# Server">>, {undefined, [], []})),

This test asserts that if the parse_line function receives the start of a section, and an empty current state, it returns a tuple with the CurrentName as <<"Server">> and the other values are empty.

To get this simple test to pass, we can use Erlang’s pattern matching mechanism:

parse_line(<<"# ", Section/binary>>, {undefined, AllSections, CurrentSection}) ->
    {Section, AllSections, CurrentSection};

Note that this function takes the current line as the first argument, and the state tuple as the remainder, significantly, it only matches undefined as the CurrentName, this means that it will actually error if we get a new section, without clearing the current section, and the process will crash.

For that second line, we can write a simple test…

    ?assertEqual({undefined, [], [{<<"version">>, <<"2.8.19">>}]},
                 redo_stats:parse_line(<<"version:2.8.19">>, {undefined, [], []})),

i.e. if we get a key:value line, then we should add that to the CurrentSection list, note that we can use the value undefined as the CurrentName, as this line shouldn’t change that.

To get this test to pass, we can again, use pattern matching.

parse_line(Line, {Section, AllSections, CurrentSection}) ->
    [Key, Value] = binary:split(Line, <<":">>),
    {Section, AllSections, [{Key, Value}] ++ CurrentSection}.

Note this returns a copy of the CurrentSection list with the new {Key, Value} pair prepended to it.

For the third line, we expect that the new value will be prepended to the list of keypairs we’ve seen in the current section, simple enough to write a test for this:

    ?assertEqual({undefined, [], [{<<"sha1">>, <<"abcdef">>}, {<<"version">>, <<"2.8.19">>}]},
                 redo_stats:parse_line(<<"sha1:abcdef">>, {undefined, [], [{<<"version">>, <<"2.8.19">>}]})),

So, we now have two key/value pairs in our CurrentSection, the sha1 and the version keys.

Finally, when we get an empty line, we expect the CurrentName to be reset, and our PreviousSections list to have the values we’ve seen in the current section to be appended, along with the previous section name.

    ?assertEqual({undefined, [{<<"Server">>, [{<<"sha1">>, <<"abcdef">>}]}], []},
                 redo_stats:parse_line(<<>>, {<<"Server">>, [], [{<<"sha1">>, <<"abcdef">>}]})).

Note that the CurrentName is undefined and our CurrentSection is now empty, and our PreviousSections element contains a tuple of the form {SectionName, KeyValues}.

This is fairly trivial to get working too.

parse_line(<<>>, {Section, AllSections, CurrentSection}) ->
    {undefined, [{Section, CurrentSection}] ++ AllSections, []};

The <<>> element matches an empty line, and again we prepend the section to the list.

So, we’ve tested that we can parse a line, and that it returns a new state given some conditions, how about parsing the output from the Redis command?

Again, we start with the test:

parse_response_test() ->
    Response = <<"# Server\r\nversion:2.8.19\r\nsha1:00000000\r\n">>,
    ?assertEqual([{<<"Server">>, [{<<"sha1">>, <<"00000000">>}, {<<"version">>, <<"2.8.19">>}]}],

To implement this is fairly trivial.

parse_response(Response) ->
    Lines = re:split(Response, <<"\r\n">>),
    {undefined, Sections, []} = lists:foldl(fun parse_line/2, {undefined, [], []}, Lines),

We split the response on the \r\n markers, and run lists:foldl on it, with some our function, and some initial state, the lists:foldl function is defined as:

Calls Fun(Elem, AccIn) on successive elements A of List, starting with AccIn == Acc0. Fun/2 must return a new accumulator which is passed to the next call. The function returns the final value of the accumulator. Acc0 is returned if the list is empty.

Again, we only match on responses that return undefined as the CurrentName which means that if we get something we can’t parse back from Redis, it will error, and thus crash.

$ rebar shell
==> redo (shell)
Erlang/OTP 17 [erts-6.4] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Eshell V6.4  (abort with ^G)
1> redo:start_link(testing).
2> Result = redo_stats:get_stats(testing).

The return value is an Erlang structure known as a “proplist”, we can fetch sections from this with…

3> Server = proplists:get_value(<<"Server">>, Result).

And even specific keys from a section:

5> proplists:get_value(<<"os">>, Server).
<<"Darwin 13.4.0 x86_64">>

All-in, the parsing of the response from Redis takes only 14 lines:

parse_response(Response) ->
    Lines = re:split(Response, <<"\r\n">>),
    {undefined, Sections, []} = lists:foldl(fun parse_line/2, {undefined, [], []}, Lines),
parse_line(<<>>, {Section, AllSections, CurrentSection}) ->
    {undefined, [{Section, CurrentSection}] ++ AllSections, []};
parse_line(<<"# ", Section/binary>>, {undefined, AllSections, CurrentSection}) ->
    {Section, AllSections, CurrentSection};
parse_line(Line, {Section, AllSections, CurrentSection}) ->
    [Key, Value] = binary:split(Line, <<":">>),
    {Section, AllSections, [{Key, Value}] ++ CurrentSection}.

SpanConf.io London 2014

Last Tuesday (29th), I attended Span Conf, “Span is a practical developer conference about building scalable systems.” it was the first one, and was mostly organised by Couchbase/Matt Revell.

It was an interesting day, attendance was around 130, with a single-track, and I attended most of the talks, before having to head off to fly home, fortunately the final two talks were recorded, so I’ll hopefully get a chance to see them soon.

First up was Richard Conway (@azurecoder), this started off with a discussion about the work Microsoft has put into getting Hadoop to run on Windows (specifically in Azure), but the last quarter covered Spark, the syntax for transformations is a considerable improvement over Hadoop, but he did say that performance is still lacking a bit behind the more highly tuned Hadoop, they’re working on it tho’ :-)

The quick-fire approach continued with Javier Ramirez (@supercoco9) on “API Analytics with Redis and BigQuery”, the Redis part fell on the floor somewhere, so this was mostly about being able to parse 1Tb of data a second, and how Google’s BigQuery works, this was probably more interesting to BigData folks than to me.

Next up was Alex Dean (@alexcrdean) “Why Your Company Needs a Unified Log”, for me at least, this heated things up a bit, Alex founded Snowplow, the folks behind “Snowplow Analytics”, which has https://github.com/snowplow/snowplow as it’s main project/product.

He covered the history of Business Analytics, from traditional Data-Warehousing via nightly ETL, right up to today’s must-have-more-data-right-now!

Snowplow is an event tracking system, that logs events to various locations, but one of the most interesting things from this for me was this http://snowplowanalytics.com/blog/2014/03/11/building-an-event-grammar-understanding-context/ and https://github.com/snowplow/snowplow/wiki/snowplow-tracker-protocol with “standards” for defining fields in logged events, Kafka was discussed at length as a mechanism for recording these historical logs, and the ability to playback segments of logs for event modelling was highlighted, and they’re able to process from AWS Kinesia too, probably worth watching the video.

I took advantage of the conference discount and bought Alex’s book, and I’ve enjoyed what’s there so far…

We continued with Richard Astbury (@richorama), who presented Microsoft’s “New Orleans” project, which implements “Silos” and “Grains”, Grains are created per entity to be tracked, and run in a clustered container (a Silo), during the talk, I kept thinking about how similar this sounded to Stateless Session EJBs and Erlang’s processes, after the talk, someone referred to it as Erlang.NET, and indeed, Robert Virding asked why they didn’t use Erlang despite this, it was actually an interesting talk, for anybody wanting to run a lightweight process per entity (psmgr?), defining a lightweight process lifecycle, including things like “suspend”.

The example cited was the online support for Halo 4 where there was a Grain per player recording the player stats, and unlike previous versions, it didn’t fall over on launch day.

Next up was Stephane Maldini (@smaldini) on “Reactive Micro-services With Reactor”, this was a bit rambling, but covered Reactor, which is a “foundation for asynchronous applications on the JVM” and is an alternative to the Microsoft-supported “Reactive extensions” and reactive event processing, this was interesting, tho’, perhaps could’ve been a bit more focussed on practical uses, he touched on reactively generating “Back Pressure”, a topic we’d return to later.

We only had a couple of lightning talks after lunch, first up was a chappie from IBM talking about Bluemix, the eye-catching demo combining Node-Red (https://github.com/node-red/node-red) with Cloud translation and Cloud “question answering” services was impressive, tho’, whether or not the people who would be most comfortable using visual programming techniques like that would actually do that is debatable, it looked good tho’ :-)

The second lightning talk was from Steve Alexander (@now_talking), about how organisations should be divided, along Dunbar’s Number-lines (http://en.wikipedia.org/wiki/Dunbar’s_number), this was a brief talk, delivered in Steve’s own style, and was an interesting wee diversion, for a better summary see this tweet.

Next up was Phil Wills’ (@philwills) talk “Scaling TheGuardian.com with Scala, Elasticsearch and more”, I’d previously watched Graham Tackley’s talk on this so, Phil’s talk wasn’t quite as new, main lessons were “separate your services” and “don’t allow a slow-service to bring down the rest of your site”, this was more a look at the evolution of the Guardian site over the past twelve years (while Phil’s been there).

Following on from that was Michael Nitschinger (@daschl) talking about “Building A Reactive Database Driver On The JVM”, at the start Matt Revell made a point of saying that it wasn’t a “Vendor Talk” (Michael works for Couchbase, who provided a lot of the organisational manpower), and to be honest, that was a disservice to Michael’s talk, it was a really interesting look at how they rebuilt the Couchbase Java driver, to support both synchronous and asynchronous styles (synchronous calls to the async layer), how they managed to build “Back Pressure” into the driver, and a look at the underlying architecture of something that most folk would naively think was “just a driver”, this was really quite interesting, and the slides are here.

Next up, was “Staying Agile In The Face Of The Data Deluge”, from Martin Kleppman (@martinkl), coincidentally, I’d purchased his up-coming book a couple of weeks ago, which is still in the early stages, and his talk was fascinating, this was about servicing requests (not necessarily HTTP requests) using Stream Processing, primarily building on-top of Kafka (for volume), and using the Apache Samza project (which he’s a committer on) for doing it, again, familiar topics came up, “Back Pressure”, streaming transformations, event histories.

I can recommend watching the video of this when it’s available, he’s uploaded his slides already (the final slide, with references is well worth a look).

The final talk I got to see before I left, was Elliot Murphy (@sstatik) “Safeguarding sensitive data in the cloud”, I’m pleased that Heroku cover most of his recommendations, but this was another really interesting talk, he compared the food industry, where people who don’t follow basic hygiene rules are considered negligent, saying that even doing the simple things (washing your hands etc) results in saving a lot of people from illness, to data-security, where a lot of people don’t follow the “basics”, he covered “hand washing” in data-security terms, and some useful guidelines on changing culture around the security of data, again, a really interesting talk, and the slides are here.

Unfortunately I had to cut and run before the last two talks, including Robert Virding talking about the history of Erlang, but I’ll watch out for the videos.

It was a nice opportunity to catch up with some old Canonical friends, and hopefully, they’ll find the energy to run it again next year.

Apparently I bought ticket #1 back in June :-)

Moved to Heroku

After almost seven and a half years, I changed jobs from Canonical to Heroku.

I’d been at Canonical for more than three and a half times longer than any previous job, it took a long time to get bored, and I’m not even sure I did, I grew frustrated with too many things and was looking for something new.

Fortunately, an ex-colleague from Canonical popped up with a daring rescue for which I’m very grateful, and just over 3 months later, I find myself getting my mind blown by how Heroku does things…lots of new things, too many services to count, and a vibrancy that had long left Canonical.

Ruby, Erlang, Go, Splunk, instant deployments, fantastic support data gathering tools, the list of new and cool things at Heroku could go on and on, and for all right now I’m still a bit bewildered and trying to find my feet, it’s exhilerating :-)

Plug in to Elixir

I was involved in building the capomastro project, which manages building multiple dependencies in Jenkins, and the subsequent archiving of those dependencies.

Part of it involves handling HTTP notifications from Jenkins when a build transitions through various states.

I’ve been looking at micro-service implementations of the various parts of the service.

I started off with a simple test to get started…pushing a simple JSON notification in the same format as the Jenkins notification plugin sends us.

defmodule ExMastro.JenkinsNotificationsTest do
  use ExUnit.Case, async: true
  use Plug.Test
  @opts ExMastro.JenkinsNotifications.init([])
  test "emits appropriate notification event" do
    body = %{name: "mytestjob", url: "job/mytestjob", build: %{number: 11, phase: "STARTED", url: "job/mytestjob/11/"}}
    conn = conn(:post, "/jenkins/notifications/?server=1", Jazz.encode!(body), headers: [{"content-type", "application/json"}])
    conn = ExMastro.JenkinsNotifications.call(conn, @opts)
    assert conn.status == 201
    assert conn.resp_body == "mytestjob"

Line 8 creates a map in the same format as the Jenkins notifications plugin sends.

Line 9 creates a Plug.Conn record, which is what Plug supplies to “plugs” when they’re handling a request, and then manually calls the Plug (in this case, a “module” Plug) to handle the request.

Line 11 calls the Routing plug to handle the connection, allowing it to dispatch based on URL and HTTP method.

Running the tests fails…not unreasonably, as we have no router or dependencies setup.

defmodule ExMastro.Mixfile do
  use Mix.Project
  def project do
    [app: :exmastro,
     version: "0.0.1",
     elixir: "~&gt; 0.15.1",
     deps: deps]
  # Configuration for the OTP application
  # Type `mix help compile.app` for more information
  def application do
    [applications: [:logger, :cowboy, :plug, :jazz],
     mod: {ExMastro, []}]
  def deps do
    [{:cowboy, "~&gt; 1.0.0"},
     {:plug, "~&gt; 0.5.3"},
     {:jazz, github: "meh/jazz"}]

Note, this app uses the Plug module, and Cowboy to do the web-serving, and Jazz to encode/decode JSON.

defmodule ExMastro.JenkinsNotifications do
  import Plug.Conn
  use Plug.Router
  plug Plug.Parsers, parsers: [ExMastro.Plugs.Parsers.JSON]
  plug :match
  plug :dispatch
  def init(options) do
  post "/jenkins/notifications" do
    conn = Plug.Conn.fetch_params(conn)
    # We have the server in conn.params["server"]·
    # And the rest of the build in the conn.params[:json]
    send_resp(conn, 201, conn.params[:json]["name"])
  match _ do
    send_resp(conn, 404, "oops")

Line 5 is interesting here, Plugs are stackable, which in practice means that they get called in order to handle HTTP requests, and as long as they return a Plug.Conn structure, and don’t call send_response, they will continue.

Line 5 puts a parsing Plug into the top of  stack, parsers are expected to check to see whether or not they can process the content of the request, and if so, they return a tuple of:

{:ok, a map with things to put into the conn.params structure, and the Conn}

This code should look for errors in the body (from the parsing plug), and do something better…

defmodule ExMastro.Plugs.Parsers.JSON do
  @moduledoc false
  alias Plug.Conn
  def parse(%Conn{} = conn, "application", "json", _headers, opts) do
    case Conn.read_body(conn, opts) do
      {:ok, body, conn} ->
        {:ok, %{json: Jazz.decode!(body)}, conn}
      {:more, _data, conn} ->
        {:error, :too_large, conn}
  def parse(conn, _type, _subtype, _headers, _opts) do
    {:next, conn}

Line 5 uses pattern matching to pick up on “application/json” content, then attempts to read the body from the Plug.Conn record, handling errors, or…parsing them as JSON, note the ! to parse and raise if it can’t parse the JSON.

This would cause the service to crash and burn, but we can setup a supervision tree to restart it.

Django on Dokku

I’ve been experimenting a bit with Dokku a bit recently, with a view to automating deployment of some small services.

Getting a Django app up and running using Dokku is fairly easy…

I’m using dokku-alt here, as it preconfigures some useful plugins and has some feature improvements over the original.

1) Create virtual machine.

Start a virtual machine wherever you can (I used a .5GiB host in Digital Ocean for this.)

2) Configure DNS entries

Configure a wildcard DNS entry for IP for your virtual machine, I run bind in my internal network at home so, for me this was a simple as creating a zone for it internally.

3) Install Dokku

Follow the dokku-alt instructions for installation.

$ curl -s https://raw.githubusercontent.com/dokku-alt/dokku-alt/master/bootstrap.sh | sudo sh

From now on, it mostly follows the “Getting started with Django on Heroku” article.

4) Prepare local virtualenv

Create a new virtualenv locally (I use virtualenvwrapper), to test your first app, and install the django-toolbelt package from the Cheeseshop.

$ mkvirtualenv djangokku && pip install django-toolbelt

5) Create a new Django app…

$ django-admin.py startproject djangokku

6) Setup the requirements

Change to the djangokku project directory and write a requirements.txt file.

$ cd djangokku
$ pip freeze > requirements.txt

7) Configure Django settings

Update the djangokku/settings.py per the instructions per the Heroku instructions.

Note, for Django 1.6 onwards, you don’t need to set the BASE_DIR as it’s already in the settings.py by default.

Also, I found the STATICFILES_DIRS settings were unnecessary.

8) Configure WSGI to serve static

This would be better done from Nginx.

Configure the WSGI file per the Heroku instructions

I left in the code to configure the environment for the correct settings that’s written out in Django, thus my djangokku/wsgi.py file looks like this:

WSGI config for djangokku project.

It exposes the WSGI callable as a module-level variable named ``application``.

For more information on this file, see

import os
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "djangokku.settings")

from django.core.wsgi import get_wsgi_application
from dj_static import Cling

application = Cling(get_wsgi_application())

9) Add a Procfile

The Procfile file “is a mechanism for declaring what commands are run by your application’s dynos on the [Dokku] platform.”

$ echo "web: gunicorn djangokku.wsgi" > Procfile

10) Setup for using git to deploy

Add at least this to your .gitignore file.


Then setup your Django project as a git repository.

$ git init .
$ git add .
$ git commit -am "Initial commit."

11) Configure git for deployments

Add a git remote to deploy directly to the Dokku instance, note that my internal DNS resolves www.dokku via the wildcard to my Digital Ocean instance.

$ git remote add dokku dokku@www.dokku:djangokku

12) Deploy!

$ git push dokku master

This should push your code to your server, and deploy it…

         Running setup.py install for pystache
           pystache: using: version '2.1' of 

           Installing pystache script to /app/.heroku/python/bin
           Installing pystache-test script to /app/.heroku/python/bin
         Running setup.py install for static

           Installing static script to /app/.heroku/python/bin
       Successfully installed Django dj-database-url dj-static django-toolbelt gunicorn psycopg2 pystache static
       Cleaning up...
-----> Preparing static assets
       Running collectstatic...
       69 static files copied to '/app/staticfiles'.

-----> Discovering process types
       Procfile declares types -> web
remote: Sending build context to Docker daemon 2.048 kB
remote: Sending build context to Docker daemon
Step 0 : FROM dokku/djangokku:build
 ---> d158b4a7afb4
Step 1 : ENV GIT_REV 981b43800b11a07eaadd7fbed628bffbc2dc9507
 ---> Running in b4fb0b26d961
 ---> 65ad2d564bb6
Removing intermediate container b4fb0b26d961
Successfully built 65ad2d564bb6
-----> Releasing djangokku ...
-----> Injecting Supervisor ...
-----> Deploying djangokku ...
=====> Application deployed:
-----> Cleaning up ...
To dokku@www.dokku:djangokku
 * [new branch]      master -> master

Visiting your newly deployed app should get the familiar “Congratulations on your first Django-powered page.”

13) Databases!

Django apps require Databases…in our case, we’ll connect djangokku to a PostgreSQL database server…

To do this, we need to ssh to the server that’s hosting dokku and create a database using dokku.

$ dokku postgresql:create testing

And link this database to our djangokku application.

dokku postgresql:link djangokku testing

This will redeploy the app, connected to the database:

-----> Releasing djangokku ...
-----> Injecting Supervisor ...
-----> Deploying djangokku ...
=====> Application deployed:

You can now run the database sync just as you would locally:

$ dokku run djangokku python ./manage.py syncdb

Creating tables ...
Creating table django_admin_log
Creating table auth_permission
Creating table auth_group_permissions
Creating table auth_group
Creating table auth_user_groups
Creating table auth_user_user_permissions
Creating table auth_user
Creating table django_content_type
Creating table django_session

You just installed Django's auth system, which means you don't have any superusers defined.
Would you like to create one now? (yes/no): yes
Username (leave blank to use 'root'): kevin
Email address: kevin@bigkevmcd.com
Password (again):
Superuser created successfully.
Installing custom SQL ...
Installing indexes ...
Installed 0 object(s) from 0 fixture(s)

That’s the basics of getting a Django app up and running in Dokku.

Branching Strategies

I’ve been using Bazaar for over 6 years, distributed version control was for me, a huge improvement over what we used previously, Subversion, which was itself a huge improvement over CVS.

Twice, we’ve thought about branching, and twice, we’ve ended up with the same basic strategy.

We want to have active development on trunk, a staging deployment, mainly to test code with a production-like setup (and data volumes), and a production release.

The production release has been scheduled as often as a week (and we’ve had more frequently than that), but has sometimes been a month or more apart (after we got out of initial development of features).

What we ended up with was three main branches, named for their deployment environments, with trunk, production and staging branches.

Show-stopper defects found in staging are fixed in staging, and the fix merged back to trunk, same for production.

Trunk is almost always deployable, but, we don’t currently use feature-flags to hide “part-baked” functionality in production, and we rarely use long-lived feature branches to develop features against.

All features/fixes are branches from trunk, and committed right back to trunk once they’ve been reviewed, this approach can sometimes lead to “super-branches”, huge branches which incorporate everything from the infrastructure to the UI, and these are problematic to review, I can definitely recommend keeping branches below 200 lines of diff for ease of reviewing.

Instead of feature-flags, we’ve tended to build from the bottom-up, so…put in the infrastructure to support the UI, and then add the UI at the end.

This hasn’t always worked well, mainly because it’s often that only seeing the UI that a user has their requirements solidified…and that can mean some rework, we’ve gotten round that by having pre-merge-to-trunk customer acceptance demos, but even this can be a bit tricky, as the user doesn’t get to use the software themselves, generally they’re getting a screen-shared demo.

We rarely have show-stopper-fix-in-production bugs, most issues are things that can make their way through the trunk-staging-production process

Scrum, Scrumban and KanBan

We switched a couple of months ago from a failing Scrum-like process to a mostly-Kanban process.

Why did we switch?

We were attempting Scrum, but in reality, our team is too small, and we have a lot of new tasks coming in, these were tasks that couldn’t really be put off, and were pushing out our burndown a lot, our sprints ended with trying to get features landed, and the day before the end invariably had several reviews in the queue.

When we were committing to a set of tasks, our customers weren’t, so every period we ended rushing to get stuff ready for the bi-weekly demo, but many of the “we also spent a fair bit of time doing this” items weren’t being recorded.

Another big failing was that we had our bi-weekly iteration planning meeting late on a friday afternoon (UK time), so after a while, when questions needed to be asked, we didn’t…because it made the end of our week that little bit closer.

Also, there was too much “throw it over the wall” from both the customer and the developers, the demo was failing (as Scrum demos are somewhat likely to) because developers don’t demo “bugs”, we failed to identify the stories of most value to the customer early, they didn’t use the product, until very late in the development cycle, and so we didn’t get into a good iterative pattern.Personally, the goal (and the gratification) of developing software is people using it, and I believe we left that too late, for various reasons.

So, Scrum just wasn’t working very well.

What were we hoping to gain?

We wanted to have better recording of tasks, when you have a prescribed story list, but lots of interrupt items, we discovered the interrupt items weren’t being recorded very well, this makes commitments to stories within a sprint hard.

We wanted to shorten the release cadence, it’s of course possible to disconnect sprints and releases in a Scrum process, but when things are focussed on the end-of-sprint demo, the expectation is that we’re demoing the stuff that will be available in the release, so we naturally fall into bi-weekly releases.

Tightening the release period, means shorter feedback loops, but this has to be understood as a way of changing things rapidly.Kanban in practice

Since we moved to Kanban, we’ve mostly brought our releases into a regularly weekly cadence, but as the classic “pigs and chickens” story for Scrum illustrates, without commitment from all parties, this is hard, “common purpose” is way underestimated in companies, different teams have different agendas, and even if they need a feature, that doesn’t mean they’ll have time to actually shepherd that feature through to completion.We brought in fairly strict Work-in-Progress (WIP) limits for our lanes, and we’ve incrementally added lanes, to reflect our process better, and there’s resistance to adding lanes (which IMO is good), but we’ve increased them from three to eight lanes since we started.

We’ve been using LeankitKanban, which has proved “ok” at best, the UI is not great, but functionally it works.

We don’t have WIP limits for some lanes, specifically our release lanes, but we’ve found that including an “External” lane,  has proved useful to keep an eye on tasks that are dependent on someone from outside our team, we can check-in with the tasks in the lane regularly, and ensure they’re not being forgotten about, we put in the “Blocked” field what it is that we are waiting on, to ensure that it’s clearly defined (or at least as clearly defined as it can be).

We have better recording of our interrupt tasks, I’m now confident that we’re recording pretty much every task that we spend time on, this proves useful in identifying repetitive tasks, as well as the more obvious use of allowing someone to pick up tasks in the event someone’s off on holiday or sick.

We started off overfilling the “Not Started” branch, which made prioritisation harder, it’s demonstrably easier to prioritise a small number of items than a large list, and we we tended to have 15-20 items in “Not Started”, but be putting items at the top of the lane regularly, so the bottom items were never getting anywhere.

We’re now keeping only 6-10 items in the “Not Started” lane, clearly prioritised, and with useful input from stakeholders on the prioritisation of items, this tightens the feedback loop on bugs in particular, as if they’re at the top of the “Not Started” lane, then they will be worked on very soon.

We’ve had problems with incomplete cards, these are problematic because they block the “In Progress” lane for an indeterminate time while we work out what to do with them, now we’re fairly strict about ensuring that before a card can be moved from the Icebox, it’s well enough understood that it can be worked on, this is an ideal opportunity for knowledge sharing to ensure that the card can be implemented by anybody in the team, as well as the opportunity to discuss the card before it’s taken up by a single developer, one of the big problems in remote-working is that you can’t really have  “over-the-desk” conversations, bouncing ideas around, the spontaneity is lost when having to start a Google Hangout, so fixes tend to be made in isolation, properly capturing the conversation around a card as it’s moved into “Not Started” is essential tho’.


The jury’s still out on Kanban, there’s a misconception that Kanban is a process, when it’s really a visualisation of your workflow, if your workflow is unclear, or faulty, Kanban won’t magically fix it, but it provides tools to make it easy to identify bottlenecks in your existing workflow, and it doesn’t define a workflow itself.

The options for lanes are virtually infinite, and you have to create the lanes to reflect your current workflow, identify the problems using  gathered data (and this is one area where Leankit really shines), and then come up with solutions, and repeat the process…

The significance of the friction that WIP generates has to be understood properly, it’s there to force discussions, it should be healthy friction, in a team with good “common purpose”, hitting the WIP limit for a lane is the siren that says “There’s a blockage here, what can I do to unblock it?”.  If you’re unable to unblock it, then a wider conversation is needed, extending the WIP limit without this conversation would be a mistake, but an easy one to make.

One of the other tenets of Kanban is continuous improvement, experimentation is encouraged, with such a loose mechanism, it’s easy to increase the WIP in a lane for a period, and assess the implications for other workflow stages.

We’ve recently tried to limit the number of cards in the “Not Started” lane, this has has meant that we’ve run out of items a couple of times, but we get together and move some items over, but the upside is that we’ve had better prioritisation.

Progress in my projects…

I hibernated over winter, not doing much around the house, but as the weather’s picked up (and with the imminent visit of my mum and dad for the first time), I’ve started to push my projects forward.

I bought a 5m reel of 5050 RGB LEDs, which are destined for under-cupboard lighting in the kitchen, in order to improve this, I’ve been experimenting with Jeenodes, and specifically an LED Node v2a, to drive the strips of LED lighting.

These are basically mini 3v arduinos with built-in radio transceivers, which makes communication between them easy enough, you can send in-memory structures fairly trivially with the RF12 library.

And, there’s a great ledNode sketch that drives the LED node.

I’ve hooked it up to a PIR sensor, and that controls the lighting (the colour of the lighting is controlled separately) via a ZMQ connected service.

I’ve got 5 Jeenodes running (and the kits for a couple more) and two Raspberry Pis providing the horsepower, along with some other bits and pieces.

I’ve also implemented calendar event broadcasting (coming from Google calendars), this was fun to do, and it essentially pulls calendar events for a list of calendars, and emits start/end of those events, services can be written to listen for those events and respond.

Current task is building a ZMQ service that talks to my node-rfxtrx library, which will then be connected to the front-end I’m building, and at that point, I’ll be happy for now :-)

Golang, concurrency and Python

I’ve been working a fair bit with Go recently, my employer has selected it as the language for developing services with, and we’re looking at ways in which Go can solve our problems.

I gave a talk last month looking at the syntax of Go, compared to Python.

Go turns out to be fast, easily compiled, and mostly easily understood, the mantra “Don’t communicate by sharing memory; share memory by communicating.” is essential to understand why Go is great to work with.

Goroutines and channels are great to synchronise processes…but, wait, yes, I’ve been a Python programmer for a long time, and it would be remiss of me to neglect Python’s built-in libraries.

We have the venerable threading library, threading is not great, it does the job (mainly slowing your Python application down), and for quick demos, you need to do a lot go make it safe to work with.

Python 2.6 (released at the start of 2008) brought with it the multiprocessing library, which has Queues and Pipes, which are really not that far removed from Go’s channels, and with a tiny bit of sugar, can prove quite useful, it gets round Python’s GIL by starting separate interpreters which means the interpreter doesn’t have to use the GIL.

from multiprocessing import Queue, Process
def main():
    exit_queue = Queue()
    timer_queue = Queue()
    datapoints_queue = Queue()
        target=send_timed_messages, args=(5, timer_queue, exit_queue)).start()
        args=(timer_queue, datapoints_queue, exit_queue)).start()
        args=(datapoints_queue, exit_queue)).start()

We create Queues, and hand them off to the subprocesses which do the work…

def push_active_builders_to_cosm(datapoints_queue, exit_queue):
    Wait for data from the datapoints_queue, and start a
    process to push it to Cosm
    exit = False
    while not exit:
        message = None
            message = datapoints_queue.get_nowait()
        except Empty:
            _, time_value, value = message.split("::")
                args=(time_value, value)).start()
            exit = exit_queue.get_nowait()
        except Empty:

To exit the process, we push True to the exit_queue in our main process, and it will exit nicely, otherwise it reads from our datapoints queue, and starts another process to push the data to Cosm.

ZeroMQ has a nice method for polling sockets, the Poller object…we can implement a reasonably efficient version using the select module (we could make it more efficient with select.epoll)

class QueueSelector(object):
    def __init__(self, queues):
        Takes a list of multiprocessing.Queue objects and returns when one of
        them is available.
        self._queues = dict(zip(
            [queue._reader for queue in queues], queues))
    def select(self):
        Returns a list of queues that have data waiting...
        (readers, [], []) = select(self._queues.keys(), [], [])
        return [self._queues[reader] for reader in readers]
def push_active_builders_to_cosm(datapoints_queue, exit_queue):
    Wait for data from the datapoints_queue, and start a
    process to push it to Cosm
    exit = False
    selector = QueueSelector([datapoints_queue, exit_queue])
    while not exit:
        message = None
        readers = selector.select()
        if datapoints_queue in readers:
            message = datapoints_queue.get()
            _, time_value, value = message.split("::")
                args=(time_value, value)).start()
        if exit_queue in readers:
            exit = exit_queue.get()

This doesn’t handle write queues, where you’re waiting for the queue to be able to be written to, but that should be fairly easy.