Connecting Gitlab to Datadog using an Iron.io Worker

Wondering how to get commit notifications from Gitlab into Datadog?

There isn't an official integration from Datadog - but with a small ruby app running on an Iron.io worker, you can create events in your Datadog Event stream when code is committed to your Gitlab repository.

You need a few things to make this happen:

  1. A free Iron.io account - signup here.
  2. A Ruby environment > 1.9 where you can install some gems.
  3. Access to your Gitlab repository Web Hooks page.
  4. An API key from your Datadog account - get it here.

Let's install the gems you'll need:

gem install iron_worker_ng

Now, create an Iron.io Project - I've called mine gitlab2datadog-demo.

After it's created, click on the "Worker" button:

Grab the iron.json credentials, so the gem knows how to upload your code:

Let's grab some code to create our worker:

git clone https://github.com/darron/iron_worker_examples.git
cd iron_worker_examples/ruby_ng/gitlab_to_datadog_webhook_worker/

Put the iron.json file into the iron_worker_examples/ruby_ng/gitlab_to_datadog_webhook_worker/ folder.

Now create a datadog.yml file in the same folder:

datadog:
        api_key: "not-my-real-key"

Using the gem we installed, upload your code to Iron.io:

iron_worker upload gitlab_to_datadog

Pay attention to the output from this command - it should look something like this:

------> Creating client
        Project 'gitlab2datadog-demo' with id='54d2dbd42f5b4f6544245355'
------> Creating code package
        Found workerfile with path='gitlab_to_datadog.worker'
        Merging file with path='datadog.yml' and dest=''
        Detected exec with path='gitlab_to_datadog.rb' and args='{}'
        Adding ruby gem dependency with name='dogapi' and version='>= 0'
        Code package name is 'gitlab_to_datadog'
------> Uploading and building code package 'gitlab_to_datadog'
        Remote building worker
        Code package uploaded with id='54d2dfc06485e3c433b04d431fd' and revision='1'
        Check 'https://hud.iron.io/tq/projects/54d2dbd42f5b4f000937965555/code/58d2dfc1675e3c433b04975d' for more info

Follow the link that's provided in the output - you should see the webhook URL:

Click the field to show the URL, copy that URL and then paste it into the Gitlab webhooks area:

Click "Add Web Hook" and if it works as planned - you'll have a "Test Hook" button to try out:

This is what I see in my Datadog Events stream now:

This is a simple way to get commit notifications into Datadog, the other types of web hooks aren't currently covered, but the code is simple enough to be adjusted.

Thanks to Iron.io for providing the original repository of examples.

I have a whole bunch of other projects in mind that could use Iron.io - glad I found this little project to try it out with!

How octohost uses Consul watches

I've been working on octohost lately, updating and upgrading some of the components. One of the things I've been looking for a chance to play with has been Consul watches - and I think I've found a great use for them.

As background, when you git push to an octohost, it builds a Docker container from the source code and the Dockerfile inside the repository. Once the container is built and ready to go, it does a few specific things:

  1. It grabs the configuration variables stored in Consul to start the proper number of containers.
  2. It registers the container as providing a Consul Service.
  3. It updates the nginx configuration files so that nginx can route traffic to the proper container.

Last week I updated octohost to improve those steps for a few reasons:

  1. If we changed any of the configuration variables, we had to manually restart the container before it would be picked up.
  2. If a container dies unexpectedly, we weren't automatically updating the nginx configuration to reflect the actual state of the application.
  3. Our nginx configuration file was being built by a gem I created and wanted to retire in place of Consul Template. The monolithic file it generated was very inflexible and I wanted to make it easier to update.

For #1, when a site is pushed to octohost, I'm registering a "watch" for a specific location in Consul's Key Value space - octohost/$container-name. That kind of watch looks like this example:

{
  "watches": [
  {
    "type": "keyprefix",
    "prefix": "octohost/html",
    "handler": "sudo /usr/bin/octo reload html"
  }
  ]
}

We're telling Consul to watch the octohost/html keys and anytime they change, to run sudo /usr/bin/octo reload html. As you can imagine, that reloads the container. Let's watch it in action:

Pretty nice eh? You can add keys or change values and the watch knows to run the handler to stop and start the container.

NOTE: Before version 0.5, deleting a key doesn't do what you'd expect, but the Consul team knows about this and has posted a fix.

NOTE: This has been disabled because of this issue: hashicorp/consul/issues/571

For #2 and #3, we look at the Consul Service catalog we are populating here and register a different type of watch - a service watch. An example service watch looks like this:

{
  "watches": [
  {
    "type": "service",
    "service": "html",
    "handler": "sudo consul-template -config /etc/nginx/templates/html.cfg -once"
  }
  ]
}

We're telling Consul to watch the html service and if the status changes, run the consul-template handler. This handler rebuilds the template we are using to tell nginx where to route container traffic. Let's watch that handler in action:

All of that was done by the Consul watch - it fires whenever it detects a change in the service catalog - I didn't have to do anything. I even killed a container at random, and it removed it from the configuration file auto-magically.

Consul watches are pretty cool. If you're adding one to your Consul cluster, remember a few things:

  1. I've used separate json files for each watch. That can be done because we're telling Consul to look in an entire directory for configuration files, the -config-dir option.
  2. When you add a new file to the config-dir, you need to tell Consul to reload so it can read and activate it. If there's a syntax error or it can't load the watch, it notes that in the logs - so keep an eye on them when you're doing this.
  3. As of this moment and because it's brand new software, Consul Template can only do a single pass to populate the values - so the templates need to be pretty simple. We have worked around those limitations by doing our own first pass to pre-populate values that are needed. Thanks to @bryanlarsen and @sethvargo who discussed a workaround here. hashicorp/consul-template/issues/88

I think I've just scratched the surface with how to use Consul watches effectively and they have helped to simplify octohost. I'm looking forward to finding new and better uses for them.

NOTE: A special shout out to Armon, Seth, Mitchell and the rest of the crew at Hashicorp for some great software that can be twisted to further my plans for world domination.

Monitoring Apache Processes with Datadog

At nonfiction, we hosted the sites we built using a number of different hosting providers. The vast majority of the sites are hosted on some Rackspace Cloud instances - they have been very reliable for our workloads.

One of those servers had been acting up recently and had been becoming unresponsive for no obvious reason, so we took a quick look one morning when we had been woken up at 5AM.

Watching top for a moment, we noticed that some Apache processes were getting very large. Some of them were using between 500MB and 1GB of RAM - that's not within the normal usage patterns.

The first thing we did was set some reasonable limits on how large the Apache + mod_php processes could get - memory_limit 256MB. Since the Apache error logs are all aggregated with Papertrail, we setup an alert that sent a message to the nonfiction Slack room if any processes were killed. Those alerts look like this:

Once that was setup, we very quickly found that a customer on a legacy website had deleted some very important web pages, when those pages were missing some very bad things could happen with the database. This had been mitigated in a subsequent software release but they hadn't been patched. Those pages were restored and they were patched. The problem was solved - at least the immediate problem.

Keeping an eye on top, there were still websites that were using up more memory than normal - at least more than we thought was normal. But sitting there watching was not a reasonable solution so we whipped up a small script to send information to Datadog's dogstatd that was on the machine.

We were grabbing the memory size of the Apache processes and sending them to Datadog - the graphs that were generated from that data look like this:

Now we had a better - albeit fairly low resolution - window into how large the Apache processes were getting.

Over the last week, we have had a good amount of data to make some changes to how Apache is configured and then measure how it responds and reacts. Here's how the entire week's memory usage looked like:

Using Datadog's built in process monitoring function and this graph, we gained some insight into how things were acting overall, but not enough detailed information into exactly which sites were the memory hogs.

In order to close that gap, I wrote another small Ruby script and between ps and /server-status we had all the information we needed:

We can now see which sites are using the most memory in the heatmap and the nonfiction team will be able to take a look at those sites and adjust as necessary. It's not a perfect solution, but it's a great way to get more visibility into exactly what's happening - and it only look a couple of hours in total.

What did we learn from all of this?

  1. Keeping MinSpareServers and MaxSpareServers relatively low can help to kill idle servers and reclaim their memory. We settled on 4 and 8 in the end - that helps to keep overall memory usage much lower.

  2. A small change - a missing page in a corporate website - can have frustrating repercussions if you don't have visibility into exactly what's happening.

  3. The information you need to solve the problem is there - it just needs to be made visible and digestible. Throwing it into Datadog gave us the data we needed to surface targets for optimization and helped us to quickly stabilize the system.

All the source code for these graphs are available here and here. Give them a try if you need more insight into how your Apache is performing.

Full disclosure: I currently work as a Site Reliability Engineer on the TechOps team at Datadog - I was co-owner of nonfiction studios for 12 years.

Consul exec is a whole lot of fun.

I've been setting up a Consul cluster lately and am pretty excited about the possibilities with consul exec

consul exec allows you to do things like this:

consul exec -node {node-name} chef-client

You can also target a service:

consul exec -service haproxy service haproxy restart
i-f6b46b1a:  * Restarting haproxy haproxy
i-f6b46b1a:    ...done.
i-f6b46b1a:
==> i-f6b46b1a: finished with exit code 0
i-24dae4c9:  * Restarting haproxy haproxy
i-24dae4c9:    ...done.
i-24dae4c9:
==> i-24dae4c9: finished with exit code 0
i-78f37694:  * Restarting haproxy haproxy
i-78f37694:    ...done.
i-78f37694:
==> i-78f37694: finished with exit code 0
3 / 3 node(s) completed / acknowledged

No ssh keys. No Capistrano timeouts. No static role and services mappings that may be out of date that very second. No muss and no fuss.

Serf - one of the technologies that underlies Consul - used to have the concept of a 'role'. We've been able to approximate these roles with Consul tags to get a similar effect.

To do that, we've added a generic service to each node in the cluster and have tagged the node with its chef roles and some other metadata:

{
  "service": {
    "name": "demoservice",
    "tags": [
      "backend",
      "role-base",
      "haproxy",
      "monitoring-client",
      "az:us-east-1c"
    ],
    "check": {
      "interval": "60s",
      "script": "/bin/true"
    }
  }
}

Now each node in the cluster, even ones that don't have a specific entry in the service catalog, have the ability to have commands run against them:

consul exec -service demoservice -tag az:us-east-1c {insert-command-here}

consul exec -service demoservice -tag haproxy {insert-command-here}

consul exec -service demoservice -tag backend {insert-command-here}

Each node runs a service check every 60 seconds - we chose something simple that will always report true.

consul exec is also really fast. Running w across a several hundred node cluster takes approximately 5 seconds with consul exec - running it with our legacy automation tool takes about 90 seconds in comparison.

I'm not sure if we're going to use it yet, but the possibilities with consul exec look pretty exciting to me.

NOTE: There are some legitimate security concerns with consul exec - right now it's pretty open - but they're looking at adding ACLs to it. It can also be completely disabled with disable_remote_exec - that may fit your risk profile until it's been tightened up a bit more.

Aloak is the worst domain registrar I have ever used.

TLDR: If you're having problems with a .ca domain name, reach out to CIRA - they may be able to help!

Late last year, I started to move 4 domain names off of Aloak - a registrar I had used for years. I was concerned:

  1. They weren't responsive to any request I had made in the last few years. I always had to ask and re-ask and continue to ask for small changes.

  2. Their web interface was abysmal and didn't work properly. I couldn't change items I needed to change.

  3. Their SSL certificate had actually expired in 2010.

After a couple of months, I had to abandon the effort - I stopped emailing them after nothing was done.

In May of 2014 I picked up the effort again and in June - it was finally done. We had enlisted DNSimple and their Concierge service - and it had finally been completed.

On July 9th I emailed a customer of ours and asked them to change their domain name server records - unfortunately their registrar was also Aloak - the worst domain registrar ever.

As I had previously experienced, the domain name changes that had been requested just weren't done.

We kept trying all throughout July, August and now through September, and the domain name servers haven't been changed all this time. Every so often we get a response like this:

Today - we got this response:

Over 3 months to change some domain name records - and it still hasn't been done.

Hey CIRA - they're "CIRA Certified"? Can you guys do anything about this?

I would transfer the domain name - but the last time it took approximately 6 months.

Any ideas for my client?

Update: CIRA was able to help my client change their domain name records and transfer the domain. Thanks everybody!