How octohost uses Consul watches

I've been working on octohost lately, updating and upgrading some of the components. One of the things I've been looking for a chance to play with has been Consul watches - and I think I've found a great use for them.

As background, when you git push to an octohost, it builds a Docker container from the source code and the Dockerfile inside the repository. Once the container is built and ready to go, it does a few specific things:

  1. It grabs the configuration variables stored in Consul to start the proper number of containers.
  2. It registers the container as providing a Consul Service.
  3. It updates the nginx configuration files so that nginx can route traffic to the proper container.

Last week I updated octohost to improve those steps for a few reasons:

  1. If we changed any of the configuration variables, we had to manually restart the container before it would be picked up.
  2. If a container dies unexpectedly, we weren't automatically updating the nginx configuration to reflect the actual state of the application.
  3. Our nginx configuration file was being built by a gem I created and wanted to retire in place of Consul Template. The monolithic file it generated was very inflexible and I wanted to make it easier to update.

For #1, when a site is pushed to octohost, I'm registering a "watch" for a specific location in Consul's Key Value space - octohost/$container-name. That kind of watch looks like this example:

{
  "watches": [
  {
    "type": "keyprefix",
    "prefix": "octohost/html",
    "handler": "sudo /usr/bin/octo reload html"
  }
  ]
}

We're telling Consul to watch the octohost/html keys and anytime they change, to run sudo /usr/bin/octo reload html. As you can imagine, that reloads the container. Let's watch it in action:

Pretty nice eh? You can add keys or change values and the watch knows to run the handler to stop and start the container.

NOTE: Before version 0.5, deleting a key doesn't do what you'd expect, but the Consul team knows about this and has posted a fix.

NOTE: This has been disabled because of this issue: hashicorp/consul/issues/571

For #2 and #3, we look at the Consul Service catalog we are populating here and register a different type of watch - a service watch. An example service watch looks like this:

{
  "watches": [
  {
    "type": "service",
    "service": "html",
    "handler": "sudo consul-template -config /etc/nginx/templates/html.cfg -once"
  }
  ]
}

We're telling Consul to watch the html service and if the status changes, run the consul-template handler. This handler rebuilds the template we are using to tell nginx where to route container traffic. Let's watch that handler in action:

All of that was done by the Consul watch - it fires whenever it detects a change in the service catalog - I didn't have to do anything. I even killed a container at random, and it removed it from the configuration file auto-magically.

Consul watches are pretty cool. If you're adding one to your Consul cluster, remember a few things:

  1. I've used separate json files for each watch. That can be done because we're telling Consul to look in an entire directory for configuration files, the -config-dir option.
  2. When you add a new file to the config-dir, you need to tell Consul to reload so it can read and activate it. If there's a syntax error or it can't load the watch, it notes that in the logs - so keep an eye on them when you're doing this.
  3. As of this moment and because it's brand new software, Consul Template can only do a single pass to populate the values - so the templates need to be pretty simple. We have worked around those limitations by doing our own first pass to pre-populate values that are needed. Thanks to @bryanlarsen and @sethvargo who discussed a workaround here. hashicorp/consul-template/issues/88

I think I've just scratched the surface with how to use Consul watches effectively and they have helped to simplify octohost. I'm looking forward to finding new and better uses for them.

NOTE: A special shout out to Armon, Seth, Mitchell and the rest of the crew at Hashicorp for some great software that can be twisted to further my plans for world domination.

Monitoring Apache Processes with Datadog

At nonfiction, we hosted the sites we built using a number of different hosting providers. The vast majority of the sites are hosted on some Rackspace Cloud instances - they have been very reliable for our workloads.

One of those servers had been acting up recently and had been becoming unresponsive for no obvious reason, so we took a quick look one morning when we had been woken up at 5AM.

Watching top for a moment, we noticed that some Apache processes were getting very large. Some of them were using between 500MB and 1GB of RAM - that's not within the normal usage patterns.

The first thing we did was set some reasonable limits on how large the Apache + mod_php processes could get - memory_limit 256MB. Since the Apache error logs are all aggregated with Papertrail, we setup an alert that sent a message to the nonfiction Slack room if any processes were killed. Those alerts look like this:

Once that was setup, we very quickly found that a customer on a legacy website had deleted some very important web pages, when those pages were missing some very bad things could happen with the database. This had been mitigated in a subsequent software release but they hadn't been patched. Those pages were restored and they were patched. The problem was solved - at least the immediate problem.

Keeping an eye on top, there were still websites that were using up more memory than normal - at least more than we thought was normal. But sitting there watching was not a reasonable solution so we whipped up a small script to send information to Datadog's dogstatd that was on the machine.

We were grabbing the memory size of the Apache processes and sending them to Datadog - the graphs that were generated from that data look like this:

Now we had a better - albeit fairly low resolution - window into how large the Apache processes were getting.

Over the last week, we have had a good amount of data to make some changes to how Apache is configured and then measure how it responds and reacts. Here's how the entire week's memory usage looked like:

Using Datadog's built in process monitoring function and this graph, we gained some insight into how things were acting overall, but not enough detailed information into exactly which sites were the memory hogs.

In order to close that gap, I wrote another small Ruby script and between ps and /server-status we had all the information we needed:

We can now see which sites are using the most memory in the heatmap and the nonfiction team will be able to take a look at those sites and adjust as necessary. It's not a perfect solution, but it's a great way to get more visibility into exactly what's happening - and it only look a couple of hours in total.

What did we learn from all of this?

  1. Keeping MinSpareServers and MaxSpareServers relatively low can help to kill idle servers and reclaim their memory. We settled on 4 and 8 in the end - that helps to keep overall memory usage much lower.

  2. A small change - a missing page in a corporate website - can have frustrating repercussions if you don't have visibility into exactly what's happening.

  3. The information you need to solve the problem is there - it just needs to be made visible and digestible. Throwing it into Datadog gave us the data we needed to surface targets for optimization and helped us to quickly stabilize the system.

All the source code for these graphs are available here and here. Give them a try if you need more insight into how your Apache is performing.

Full disclosure: I currently work as a Site Reliability Engineer on the TechOps team at Datadog - I was co-owner of nonfiction studios for 12 years.

Consul exec is a whole lot of fun.

I've been setting up a Consul cluster lately and am pretty excited about the possibilities with consul exec

consul exec allows you to do things like this:

consul exec -node {node-name} chef-client

You can also target a service:

consul exec -service haproxy service haproxy restart
i-f6b46b1a:  * Restarting haproxy haproxy
i-f6b46b1a:    ...done.
i-f6b46b1a:
==> i-f6b46b1a: finished with exit code 0
i-24dae4c9:  * Restarting haproxy haproxy
i-24dae4c9:    ...done.
i-24dae4c9:
==> i-24dae4c9: finished with exit code 0
i-78f37694:  * Restarting haproxy haproxy
i-78f37694:    ...done.
i-78f37694:
==> i-78f37694: finished with exit code 0
3 / 3 node(s) completed / acknowledged

No ssh keys. No Capistrano timeouts. No static role and services mappings that may be out of date that very second. No muss and no fuss.

Serf - one of the technologies that underlies Consul - used to have the concept of a 'role'. We've been able to approximate these roles with Consul tags to get a similar effect.

To do that, we've added a generic service to each node in the cluster and have tagged the node with its chef roles and some other metadata:

{
  "service": {
    "name": "demoservice",
    "tags": [
      "backend",
      "role-base",
      "haproxy",
      "monitoring-client",
      "az:us-east-1c"
    ],
    "check": {
      "interval": "60s",
      "script": "/bin/true"
    }
  }
}

Now each node in the cluster, even ones that don't have a specific entry in the service catalog, have the ability to have commands run against them:

consul exec -service demoservice -tag az:us-east-1c {insert-command-here}

consul exec -service demoservice -tag haproxy {insert-command-here}

consul exec -service demoservice -tag backend {insert-command-here}

Each node runs a service check every 60 seconds - we chose something simple that will always report true.

consul exec is also really fast. Running w across a several hundred node cluster takes approximately 5 seconds with consul exec - running it with our legacy automation tool takes about 90 seconds in comparison.

I'm not sure if we're going to use it yet, but the possibilities with consul exec look pretty exciting to me.

NOTE: There are some legitimate security concerns with consul exec - right now it's pretty open - but they're looking at adding ACLs to it. It can also be completely disabled with disable_remote_exec - that may fit your risk profile until it's been tightened up a bit more.

Aloak is the worst domain registrar I have ever used.

TLDR: If you're having problems with a .ca domain name, reach out to CIRA - they may be able to help!

Late last year, I started to move 4 domain names off of Aloak - a registrar I had used for years. I was concerned:

  1. They weren't responsive to any request I had made in the last few years. I always had to ask and re-ask and continue to ask for small changes.

  2. Their web interface was abysmal and didn't work properly. I couldn't change items I needed to change.

  3. Their SSL certificate had actually expired in 2010.

After a couple of months, I had to abandon the effort - I stopped emailing them after nothing was done.

In May of 2014 I picked up the effort again and in June - it was finally done. We had enlisted DNSimple and their Concierge service - and it had finally been completed.

On July 9th I emailed a customer of ours and asked them to change their domain name server records - unfortunately their registrar was also Aloak - the worst domain registrar ever.

As I had previously experienced, the domain name changes that had been requested just weren't done.

We kept trying all throughout July, August and now through September, and the domain name servers haven't been changed all this time. Every so often we get a response like this:

Today - we got this response:

Over 3 months to change some domain name records - and it still hasn't been done.

Hey CIRA - they're "CIRA Certified"? Can you guys do anything about this?

I would transfer the domain name - but the last time it took approximately 6 months.

Any ideas for my client?

Update: CIRA was able to help my client change their domain name records and transfer the domain. Thanks everybody!

TestKitchen, Dropbox and Growl - a remote build server

I've been working on a lot of Chef cookbooks lately. We've been upgrading some old ones, adding tests, integrating them with TestKitchen and generally making them a lot better.

As such, there have been a ton of integration tests run. Once you add a few test suites a cookbook that tests 3 different platforms now turns into a 9 VM run. While it doesn't take a lot of memory, it certainly takes a lot of horsepower and time to launch 9 VM's, converge and then run the integration tests.

I have a few machines in my home office, and I've been on the lookout for more efficient ways to use them, here's one great way to pretty effortlessly use a different (and possibly more powerful) machine to run your tests.

Why would you want to do this?

You may not always working on your most powerful machine, or you're doing other things that you'd like to have additional horsepower for on your local machine - so why not use an idle machine to run them all for you?

What obstacles do we need to overcome?

  1. We need to get the files we're changing from one machine to another.
  2. We need to get that machine to automatically run the test suites.
  3. We need to get the results of those test suites back to other machine.

What do you need?

  1. A cookbook to test using TestKitchen.
  2. Dropbox installed and working on both machines. (This helps with #1 above.)
  3. Growl installed on both machines. Make sure to enable forwarded notifications and enter passwords where needed. (This helps with #3 above.)
  4. Growlnotify installed on the build machine - can also be installed from brew: brew cask install growlnotify
  5. Guard and Growl gems - here's an example Gemfile. (This helps with #2 above.)
  6. A Guardfile with Growl notification enabled - here's an example Guardfile

How do you start?

On the build box:

Change to the directory where you have your cookbook and run guard.

This will start up Guard, run any lint/syntax tests, kitchen creates all of your integration suites and platform targets and gets ready to run. Some sample output is below:

darron@: guard
11:33:24 - INFO - Guard is using Growl to send notifications.
11:33:24 - INFO - Inspecting Ruby code style of all files
Inspecting 16 files
................

16 files inspected, no offenses detected
11:33:25 - INFO - Linting all cookbooks

11:33:26 - INFO - Guard::RSpec is running
11:33:26 - INFO - Running all specs
Run options: exclude {:wip=>true}
.........

Finished in 0.58145 seconds (files took 2.07 seconds to load)
9 examples, 0 failures

11:33:30 - INFO - Guard::Kitchen is starting
-----> Starting Kitchen (v1.2.1)
-----> Creating <default-ubuntu-1004>...
       Bringing machine 'default' up with 'virtualbox' provider...
       ==> default: Importing base box 'chef-ubuntu-10.04'...
       ==> default: Matching MAC address for NAT networking...
       ==> default: Setting the name of the VM: default-ubuntu-1004_default_1408728822930
       ==> default: Clearing any previously set network interfaces...
       ==> default: Preparing network interfaces based on configuration...
           default: Adapter 1: nat
       ==> default: Forwarding ports...
           default: 22 => 2222 (adapter 1)
       ==> default: Booting VM...
       ==> default: Waiting for machine to boot. This may take a few minutes...
           default: SSH address: 127.0.0.1:2222
           default: SSH username: vagrant
# Lots of output snipped....
-----> Creating <crawler-ubuntu-1404>...
       Bringing machine 'default' up with 'virtualbox' provider...
       ==> default: Importing base box 'chef-ubuntu-14.04'...
       ==> default: Matching MAC address for NAT networking...
       ==> default: Setting the name of the VM: crawler-ubuntu-1404_default_1408729156367
       ==> default: Fixed port collision for 22 => 2222. Now on port 2207.
       ==> default: Clearing any previously set network interfaces...
       ==> default: Preparing network interfaces based on configuration...
           default: Adapter 1: nat
       ==> default: Forwarding ports...
           default: 22 => 2207 (adapter 1)
       ==> default: Booting VM...
       ==> default: Waiting for machine to boot. This may take a few minutes...
           default: SSH address: 127.0.0.1:2207
           default: SSH username: vagrant
           default: SSH auth method: private key
           default: Warning: Connection timeout. Retrying...
       ==> default: Machine booted and ready!
       ==> default: Checking for guest additions in VM...
       ==> default: Setting hostname...
       ==> default: Machine not provisioning because `--no-provision` is specified.
       Vagrant instance <crawler-ubuntu-1404> created.
       Finished creating <crawler-ubuntu-1404> (0m45.51s).
-----> Kitchen is finished. (6m16.96s)
11:39:48 - INFO - Guard is now watching at '~/test-cookbook'
[1] guard(main)>

All of these suites and their respective platforms are now ready:

darron@: kitchen list
Instance             Driver   Provisioner  Last Action
default-ubuntu-1004  Vagrant  ChefZero     Created
default-ubuntu-1204  Vagrant  ChefZero     Created
default-ubuntu-1404  Vagrant  ChefZero     Created
jenkins-ubuntu-1004  Vagrant  ChefZero     Created
jenkins-ubuntu-1204  Vagrant  ChefZero     Created
jenkins-ubuntu-1404  Vagrant  ChefZero     Created
crawler-ubuntu-1004  Vagrant  ChefZero     Created
crawler-ubuntu-1204  Vagrant  ChefZero     Created
crawler-ubuntu-1404  Vagrant  ChefZero     Created

On your development box:

Once kitchen create is complete - if you've setup Dropbox and Growl correctly - you should get a notification on your screen. Here's the notifications I received:

In my case, Guard ran some syntax/lint tests, Rspec tests, and then got all of the integration platforms and suites ready to go.

Let's get our tests to run automagically.

In your cookbook, make a change to your code and save it.

Very quickly (a couple of seconds in my case), Dropbox will send your file to the other machine, Guard will notice that a file has changed and will run the tests automatically. If you're working on your integration tests, it will run a kitchen converge and kitchen verify for each suite and platform combination.

Once that's complete, you should get a notification on your screen - this is what I see:

If you're working on some Chefspec tests, this may be what you'd see:

To sum it up - this allows you to:

  1. Develop on one machine.
  2. Run your builds on another.
  3. Get notifications when the builds are complete.
  4. Profit.

If you've got a spare machine lying around your office - maybe even an underutilized MacPro - give it a try!

Any questions? Any problems? Let me know!