Monitor First

I had the privilege to present today at Devopsdays Chicago. I condensed a proposed 30 minute talk down to 20 slides in an Ignite format. There's way more things I could say about Consul - but 5 minutes is just not enough time.

Below the slides, I've placed the transcript of what I had planned to say - hopefully the YouTube upload will be posted shortly.

Slides at Speakerdeck

Video on YouTube

  1. I don’t know about many of you, but before I started working at Datadog, even though I loved monitoring, I monitored my boxes at the END of a project. Build the thing. Test the thing. Put it into prod. Oh yeah - monitor some of the things - and write the docs.

  2. I would setup my tools, grab some metrics, get some pretty graphs, and I thought that I was doing the right thing. I thought that since I satisfied the “Monitor all the things” I had succeeded. Check.

  3. Nope. In fact I had failed. And I deserved a wicked dragon kick from a devops ninja. I had missed one of the most crucial times to monitor my service - monitor it from scratch - think about monitoring it when the packages are still in the repo.


  5. Plan to monitor it before there’s even data. Especially if it’s big data. For an example, we decided we needed to prototype something that could help us. We had hundreds of VMs and 30 minute Chef runs were just too long to change a feature toggle.

  6. We looked at a few options and Consul looked like it had the components we were after: Small Go binary, DNS and HTTP interface for service discovery, Key Value storage engine with simple clients, integrated Failure Detection - we were excited.

  7. But we were also a bit afraid of it - this is a new tool. How much memory would it take? Would it interfere with other processes? Would it be destabilizing to our clusters and impact production? There were many unknown unknowns – the things we didn’t know that we didn’t know yet.

  8. So we started as many of us do - by fixing staging. We read the docs, Chef’d up some recipes, seasoned to taste and got a baby cluster running. Now what should we monitor?

  9. On the Consul server side, we started with a few standard metrics: Overall Average networking, networking / server and cpu / server. Great - we’ve replicated Munin - my work is done here.

  10. We wondered if the agent would use up all of our precious memory, drive the OOM killer crazy and stop our processes? Nope. It didn’t do that either. None of our worries materialized - most likely because we weren’t really using anything.

  11. But, as we worked with Consul, broke and fixed the cluster, we quickly found the 2 metrics that were the most important: 1. Do we have a leader? 2. Has there been a leadership transition event?

  12. After a couple weeks of exploration and watching it idle without taking down the world, we thought: Staging isn’t prod. Let’s see how the cluster behaves with more nodes. It's probably fine.

  13. Hmm. Thats not good. Sure looks like a lot of leadership transitions. Definitely more than in staging. How about we add a couple more server nodes for the increased cluster size?

  14. Ah yes - 5 seems about right. Now that it’s all calmed down and we’re feeling lucky, I heard there’s this really cool app to try. Let’s run it on all our nodes and get up to date configuration files that reload our services on the fly.

  15. My bad. OK - maybe that wasn’t such a good idea. Too many things querying for too many other things at once. Maybe building a file on every machine at once isn’t the right thing to do. There’s got to be a better way.

  16. Ahh OK - that's much better. Let’s build it on a single node and distribute it through the integrated Key/Value store. No more unending leadership transitions. No more scary graphs. Much wow.

  17. Because we monitored first, we can experiment and see the impact of our choices before they become next year’s technical debt. Because we monitored first we can make decisions with hard data rather than with just our gut feels.

  18. Because we monitored first, when we ran into strange pauses - we could collect additional metrics and discover - individual nodes aren’t going deaf - the server is - and that’s affecting groups of nodes.

  19. So please - monitor first - not last. Make sure that the thing you’re building is doing what you think it’s doing before it’s too late and you have to do a 270 away from certain peril.

  20. Monitor first - just never know what Shia might do if you don’t.

Using Amazon Auto Scaling Groups with Packer and Chef

Using Amazon Auto Scaling Groups with Packer built custom Amazon Machine Images and Chef server can help to make your infrastructure better to respond to changing conditions, but there are a lot of moving parts that need to be connected in order for it to work properly.

I have never seen them documented in a single place so am documenting it for posterity and explanation.

There are 3 main phases in the lifecycle that we need to plan for:

  1. Build Phase - preparing the AMI to be used.
  2. Running Phase - connecting the AMI to the Chef server on launch and scaling up as needed.
  3. Teardown Phase - after a node is removed, the Chef server node and client data needs to be deleted.

During the build phase you will need to setup and configure these items:

  1. Amazon user with permissions to launch an instance with a particular IAM Profile and Security Group.
  2. Packer needs to have access to the Chef server validator key so that it can create a new node and client in the Chef server. This can be done using the IAM Profile or you may actually have the key available to Packer locally.
  3. Packer needs a configuration file that builds the AMI. Here's an example that uses EBS volumes to store the AMI.
  4. The Amazon user needs to be able to save the resulting AMI to your Amazon account. Those AMIs are stored as either an EBS volume (the simplest method) or is uploaded into an S3 bucket.
  5. Auto Scaling is picky about what instance type you built and will be running it on. It's easiest to build and run it on the same type. If you're just manually running them or not using Auto Scaling then you can usually mix types without trouble.

Packer actually takes care of automating most of of this - but there's lots of things going on. At the end of the build, it's critically important to remember:

  1. The Chef client and node from the AMI you just built needs to be removed from the Chef server. (Packer does this for you.)
  2. You need to make sure to remove the Chef client.pem, client.rb, validation.pem and first-boot.json - they're going to need to be re-created when it boots again.
  3. Some other software may have saved state you want to remove - for example - we disable the Datadog Agent and remove all Consul server state.

During the Running phase, when you're actually using the AMI image you built, you will need to setup and configure:

  1. A Launch Configuration which details AMI, Instance type, keys, IAM Profile and Security Groups - among some other things.
  2. An Auto Scaling Group which uses the Launch Configuration we just created and adds desired capacity, availability zones, auto-scaling cooldowns and some user-data. The primary goal for the user-data is to re-connect the new Instance to the Chef server so that provisioning can complete. Example user-data.
  3. In order to scale your group up, you'll need to create a Scaling Policy that details how you will be scaling the group.
  4. A Cloudwatch Metric Alarm tells your Scaling Policy when to enact the change.
  5. To scale your group down, you need to create another Scaling Policy that tells the group how to accomplish that.
  6. A final Metric Alarm details the conditions that will tell your Down Scaling Policy when to remove instances.
  7. When any of these events happen, you should be notified. Amazon SNS is a great service that can notify you when that occurs. We are sending the notifications to an Amazon SQS queue so that any instances that are scaled down can be easily removed from the Chef server.

All of these items can be configured using the Amazon Management Console or EC2 API tools. The Management Console is easy to use - but the API tools can be automated so that you don't have to spend as much time doing it:

~@: ./ ami-0xdeadb33f staging
Creating 'haproxy' ASG with ami-0xdeadb33f for staging.
Creating Launch Configuration: Success
Creating Auto Scaling Group: Success
Creating Scale Up Policy: Success
Creating Up Metric Alarm: Success
Creating Scale Down Policy: Success
Creating Down Metric Alarm: Success
Creating SNS Notification: Success

Once you've created all of those items, the Auto Scaling Group will have automatically started up and should be serving your traffic.

You can easily test your scaling policy - I use stress to trigger the Metric Alarm: apt-get install -y stress && stress -c 2

Stressing the CPUs can trigger an Auto Scaling event, which adds the amount of servers you have chosen to your group. After they've been added and the cooldown you specified earlier has passed, you can stop stressing the servers and they should scale back down.

At this point, you need to make sure that you're ready to deal with the third teardown phase.

During the teardown phase you need to:

  1. Deal with any state you need to keep from the auto-scaled servers. This can be complicated to deal with and is beyond the scope of this blog post. We are using Auto Scaling Groups with stateless servers that can be discarded at any time.
  2. Remove the node and client data from the Chef server. We are using a modified version of this script which runs in a Docker container. That script needs AWS and Chef access to accomplish the client and node deletion.

Hopefully that helps give some clues and examples how to accomplish this for your own infrastructure. If you're starting from scratch - it might be simplest to master a single phase at a time before you move ahead to the next.

Please let me know if you've got any questions or would like something clarified.

Connecting Gitlab to Datadog using an Worker

Wondering how to get commit notifications from Gitlab into Datadog?

There isn't an official integration from Datadog - but with a small ruby app running on an worker, you can create events in your Datadog Event stream when code is committed to your Gitlab repository.

You need a few things to make this happen:

  1. A free account - signup here.
  2. A Ruby environment > 1.9 where you can install some gems.
  3. Access to your Gitlab repository Web Hooks page.
  4. An API key from your Datadog account - get it here.

Let's install the gems you'll need:

gem install iron_worker_ng

Now, create an Project - I've called mine gitlab2datadog-demo.

After it's created, click on the "Worker" button:

Grab the iron.json credentials, so the gem knows how to upload your code:

Let's grab some code to create our worker:

git clone
cd iron_worker_examples/ruby_ng/gitlab_to_datadog_webhook_worker/

Put the iron.json file into the iron_worker_examples/ruby_ng/gitlab_to_datadog_webhook_worker/ folder.

Now create a datadog.yml file in the same folder:

        api_key: "not-my-real-key"

Using the gem we installed, upload your code to

iron_worker upload gitlab_to_datadog

Pay attention to the output from this command - it should look something like this:

------> Creating client
        Project 'gitlab2datadog-demo' with id='54d2dbd42f5b4f6544245355'
------> Creating code package
        Found workerfile with path='gitlab_to_datadog.worker'
        Merging file with path='datadog.yml' and dest=''
        Detected exec with path='gitlab_to_datadog.rb' and args='{}'
        Adding ruby gem dependency with name='dogapi' and version='>= 0'
        Code package name is 'gitlab_to_datadog'
------> Uploading and building code package 'gitlab_to_datadog'
        Remote building worker
        Code package uploaded with id='54d2dfc06485e3c433b04d431fd' and revision='1'
        Check '' for more info

Follow the link that's provided in the output - you should see the webhook URL:

Click the field to show the URL, copy that URL and then paste it into the Gitlab webhooks area:

Click "Add Web Hook" and if it works as planned - you'll have a "Test Hook" button to try out:

This is what I see in my Datadog Events stream now:

This is a simple way to get commit notifications into Datadog, the other types of web hooks aren't currently covered, but the code is simple enough to be adjusted.

Thanks to for providing the original repository of examples.

I have a whole bunch of other projects in mind that could use - glad I found this little project to try it out with!

How octohost uses Consul watches

I've been working on octohost lately, updating and upgrading some of the components. One of the things I've been looking for a chance to play with has been Consul watches - and I think I've found a great use for them.

As background, when you git push to an octohost, it builds a Docker container from the source code and the Dockerfile inside the repository. Once the container is built and ready to go, it does a few specific things:

  1. It grabs the configuration variables stored in Consul to start the proper number of containers.
  2. It registers the container as providing a Consul Service.
  3. It updates the nginx configuration files so that nginx can route traffic to the proper container.

Last week I updated octohost to improve those steps for a few reasons:

  1. If we changed any of the configuration variables, we had to manually restart the container before it would be picked up.
  2. If a container dies unexpectedly, we weren't automatically updating the nginx configuration to reflect the actual state of the application.
  3. Our nginx configuration file was being built by a gem I created and wanted to retire in place of Consul Template. The monolithic file it generated was very inflexible and I wanted to make it easier to update.

For #1, when a site is pushed to octohost, I'm registering a "watch" for a specific location in Consul's Key Value space - octohost/$container-name. That kind of watch looks like this example:

  "watches": [
    "type": "keyprefix",
    "prefix": "octohost/html",
    "handler": "sudo /usr/bin/octo reload html"

We're telling Consul to watch the octohost/html keys and anytime they change, to run sudo /usr/bin/octo reload html. As you can imagine, that reloads the container. Let's watch it in action:

Pretty nice eh? You can add keys or change values and the watch knows to run the handler to stop and start the container.

NOTE: Before version 0.5, deleting a key doesn't do what you'd expect, but the Consul team knows about this and has posted a fix.

NOTE: This has been disabled because of this issue: hashicorp/consul/issues/571

For #2 and #3, we look at the Consul Service catalog we are populating here and register a different type of watch - a service watch. An example service watch looks like this:

  "watches": [
    "type": "service",
    "service": "html",
    "handler": "sudo consul-template -config /etc/nginx/templates/html.cfg -once"

We're telling Consul to watch the html service and if the status changes, run the consul-template handler. This handler rebuilds the template we are using to tell nginx where to route container traffic. Let's watch that handler in action:

All of that was done by the Consul watch - it fires whenever it detects a change in the service catalog - I didn't have to do anything. I even killed a container at random, and it removed it from the configuration file auto-magically.

Consul watches are pretty cool. If you're adding one to your Consul cluster, remember a few things:

  1. I've used separate json files for each watch. That can be done because we're telling Consul to look in an entire directory for configuration files, the -config-dir option.
  2. When you add a new file to the config-dir, you need to tell Consul to reload so it can read and activate it. If there's a syntax error or it can't load the watch, it notes that in the logs - so keep an eye on them when you're doing this.
  3. As of this moment and because it's brand new software, Consul Template can only do a single pass to populate the values - so the templates need to be pretty simple. We have worked around those limitations by doing our own first pass to pre-populate values that are needed. Thanks to @bryanlarsen and @sethvargo who discussed a workaround here. hashicorp/consul-template/issues/88

I think I've just scratched the surface with how to use Consul watches effectively and they have helped to simplify octohost. I'm looking forward to finding new and better uses for them.

NOTE: A special shout out to Armon, Seth, Mitchell and the rest of the crew at Hashicorp for some great software that can be twisted to further my plans for world domination.

Monitoring Apache Processes with Datadog

At nonfiction, we hosted the sites we built using a number of different hosting providers. The vast majority of the sites are hosted on some Rackspace Cloud instances - they have been very reliable for our workloads.

One of those servers had been acting up recently and had been becoming unresponsive for no obvious reason, so we took a quick look one morning when we had been woken up at 5AM.

Watching top for a moment, we noticed that some Apache processes were getting very large. Some of them were using between 500MB and 1GB of RAM - that's not within the normal usage patterns.

The first thing we did was set some reasonable limits on how large the Apache + mod_php processes could get - memory_limit 256MB. Since the Apache error logs are all aggregated with Papertrail, we setup an alert that sent a message to the nonfiction Slack room if any processes were killed. Those alerts look like this:

Once that was setup, we very quickly found that a customer on a legacy website had deleted some very important web pages, when those pages were missing some very bad things could happen with the database. This had been mitigated in a subsequent software release but they hadn't been patched. Those pages were restored and they were patched. The problem was solved - at least the immediate problem.

Keeping an eye on top, there were still websites that were using up more memory than normal - at least more than we thought was normal. But sitting there watching was not a reasonable solution so we whipped up a small script to send information to Datadog's dogstatd that was on the machine.

We were grabbing the memory size of the Apache processes and sending them to Datadog - the graphs that were generated from that data look like this:

Now we had a better - albeit fairly low resolution - window into how large the Apache processes were getting.

Over the last week, we have had a good amount of data to make some changes to how Apache is configured and then measure how it responds and reacts. Here's how the entire week's memory usage looked like:

Using Datadog's built in process monitoring function and this graph, we gained some insight into how things were acting overall, but not enough detailed information into exactly which sites were the memory hogs.

In order to close that gap, I wrote another small Ruby script and between ps and /server-status we had all the information we needed:

We can now see which sites are using the most memory in the heatmap and the nonfiction team will be able to take a look at those sites and adjust as necessary. It's not a perfect solution, but it's a great way to get more visibility into exactly what's happening - and it only look a couple of hours in total.

What did we learn from all of this?

  1. Keeping MinSpareServers and MaxSpareServers relatively low can help to kill idle servers and reclaim their memory. We settled on 4 and 8 in the end - that helps to keep overall memory usage much lower.

  2. A small change - a missing page in a corporate website - can have frustrating repercussions if you don't have visibility into exactly what's happening.

  3. The information you need to solve the problem is there - it just needs to be made visible and digestible. Throwing it into Datadog gave us the data we needed to surface targets for optimization and helped us to quickly stabilize the system.

All the source code for these graphs are available here and here. Give them a try if you need more insight into how your Apache is performing.

Full disclosure: I currently work as a Site Reliability Engineer on the TechOps team at Datadog - I was co-owner of nonfiction studios for 12 years.