Monitoring Apache Processes with Datadog

At nonfiction, we hosted the sites we built using a number of different hosting providers. The vast majority of the sites are hosted on some Rackspace Cloud instances - they have been very reliable for our workloads.

One of those servers had been acting up recently and had been becoming unresponsive for no obvious reason, so we took a quick look one morning when we had been woken up at 5AM.

Watching top for a moment, we noticed that some Apache processes were getting very large. Some of them were using between 500MB and 1GB of RAM - that's not within the normal usage patterns.

The first thing we did was set some reasonable limits on how large the Apache + mod_php processes could get - memory_limit 256MB. Since the Apache error logs are all aggregated with Papertrail, we setup an alert that sent a message to the nonfiction Slack room if any processes were killed. Those alerts look like this:

Once that was setup, we very quickly found that a customer on a legacy website had deleted some very important web pages, when those pages were missing some very bad things could happen with the database. This had been mitigated in a subsequent software release but they hadn't been patched. Those pages were restored and they were patched. The problem was solved - at least the immediate problem.

Keeping an eye on top, there were still websites that were using up more memory than normal - at least more than we thought was normal. But sitting there watching was not a reasonable solution so we whipped up a small script to send information to Datadog's dogstatd that was on the machine.

We were grabbing the memory size of the Apache processes and sending them to Datadog - the graphs that were generated from that data look like this:

Now we had a better - albeit fairly low resolution - window into how large the Apache processes were getting.

Over the last week, we have had a good amount of data to make some changes to how Apache is configured and then measure how it responds and reacts. Here's how the entire week's memory usage looked like:

Using Datadog's built in process monitoring function and this graph, we gained some insight into how things were acting overall, but not enough detailed information into exactly which sites were the memory hogs.

In order to close that gap, I wrote another small Ruby script and between ps and /server-status we had all the information we needed:

We can now see which sites are using the most memory in the heatmap and the nonfiction team will be able to take a look at those sites and adjust as necessary. It's not a perfect solution, but it's a great way to get more visibility into exactly what's happening - and it only look a couple of hours in total.

What did we learn from all of this?

  1. Keeping MinSpareServers and MaxSpareServers relatively low can help to kill idle servers and reclaim their memory. We settled on 4 and 8 in the end - that helps to keep overall memory usage much lower.

  2. A small change - a missing page in a corporate website - can have frustrating repercussions if you don't have visibility into exactly what's happening.

  3. The information you need to solve the problem is there - it just needs to be made visible and digestible. Throwing it into Datadog gave us the data we needed to surface targets for optimization and helped us to quickly stabilize the system.

All the source code for these graphs are available here and here. Give them a try if you need more insight into how your Apache is performing.

Full disclosure: I currently work as a Site Reliability Engineer on the TechOps team at Datadog - I was co-owner of nonfiction studios for 12 years.

Consul exec is a whole lot of fun.

I've been setting up a Consul cluster lately and am pretty excited about the possibilities with consul exec

consul exec allows you to do things like this:

consul exec -node {node-name} chef-client

You can also target a service:

consul exec -service haproxy service haproxy restart
i-f6b46b1a:  * Restarting haproxy haproxy
i-f6b46b1a:    ...done.
==> i-f6b46b1a: finished with exit code 0
i-24dae4c9:  * Restarting haproxy haproxy
i-24dae4c9:    ...done.
==> i-24dae4c9: finished with exit code 0
i-78f37694:  * Restarting haproxy haproxy
i-78f37694:    ...done.
==> i-78f37694: finished with exit code 0
3 / 3 node(s) completed / acknowledged

No ssh keys. No Capistrano timeouts. No static role and services mappings that may be out of date that very second. No muss and no fuss.

Serf - one of the technologies that underlies Consul - used to have the concept of a 'role'. We've been able to approximate these roles with Consul tags to get a similar effect.

To do that, we've added a generic service to each node in the cluster and have tagged the node with its chef roles and some other metadata:

  "service": {
    "name": "demoservice",
    "tags": [
    "check": {
      "interval": "60s",
      "script": "/bin/true"

Now each node in the cluster, even ones that don't have a specific entry in the service catalog, have the ability to have commands run against them:

consul exec -service demoservice -tag az:us-east-1c {insert-command-here}

consul exec -service demoservice -tag haproxy {insert-command-here}

consul exec -service demoservice -tag backend {insert-command-here}

Each node runs a service check every 60 seconds - we chose something simple that will always report true.

consul exec is also really fast. Running w across a several hundred node cluster takes approximately 5 seconds with consul exec - running it with our legacy automation tool takes about 90 seconds in comparison.

I'm not sure if we're going to use it yet, but the possibilities with consul exec look pretty exciting to me.

NOTE: There are some legitimate security concerns with consul exec - right now it's pretty open - but they're looking at adding ACLs to it. It can also be completely disabled with disable_remote_exec - that may fit your risk profile until it's been tightened up a bit more.

Aloak is the worst domain registrar I have ever used.

TLDR: If you're having problems with a .ca domain name, reach out to CIRA - they may be able to help!

Late last year, I started to move 4 domain names off of Aloak - a registrar I had used for years. I was concerned:

  1. They weren't responsive to any request I had made in the last few years. I always had to ask and re-ask and continue to ask for small changes.

  2. Their web interface was abysmal and didn't work properly. I couldn't change items I needed to change.

  3. Their SSL certificate had actually expired in 2010.

After a couple of months, I had to abandon the effort - I stopped emailing them after nothing was done.

In May of 2014 I picked up the effort again and in June - it was finally done. We had enlisted DNSimple and their Concierge service - and it had finally been completed.

On July 9th I emailed a customer of ours and asked them to change their domain name server records - unfortunately their registrar was also Aloak - the worst domain registrar ever.

As I had previously experienced, the domain name changes that had been requested just weren't done.

We kept trying all throughout July, August and now through September, and the domain name servers haven't been changed all this time. Every so often we get a response like this:

Today - we got this response:

Over 3 months to change some domain name records - and it still hasn't been done.

Hey CIRA - they're "CIRA Certified"? Can you guys do anything about this?

I would transfer the domain name - but the last time it took approximately 6 months.

Any ideas for my client?

Update: CIRA was able to help my client change their domain name records and transfer the domain. Thanks everybody!

TestKitchen, Dropbox and Growl - a remote build server

I've been working on a lot of Chef cookbooks lately. We've been upgrading some old ones, adding tests, integrating them with TestKitchen and generally making them a lot better.

As such, there have been a ton of integration tests run. Once you add a few test suites a cookbook that tests 3 different platforms now turns into a 9 VM run. While it doesn't take a lot of memory, it certainly takes a lot of horsepower and time to launch 9 VM's, converge and then run the integration tests.

I have a few machines in my home office, and I've been on the lookout for more efficient ways to use them, here's one great way to pretty effortlessly use a different (and possibly more powerful) machine to run your tests.

Why would you want to do this?

You may not always working on your most powerful machine, or you're doing other things that you'd like to have additional horsepower for on your local machine - so why not use an idle machine to run them all for you?

What obstacles do we need to overcome?

  1. We need to get the files we're changing from one machine to another.
  2. We need to get that machine to automatically run the test suites.
  3. We need to get the results of those test suites back to other machine.

What do you need?

  1. A cookbook to test using TestKitchen.
  2. Dropbox installed and working on both machines. (This helps with #1 above.)
  3. Growl installed on both machines. Make sure to enable forwarded notifications and enter passwords where needed. (This helps with #3 above.)
  4. Growlnotify installed on the build machine - can also be installed from brew: brew cask install growlnotify
  5. Guard and Growl gems - here's an example Gemfile. (This helps with #2 above.)
  6. A Guardfile with Growl notification enabled - here's an example Guardfile

How do you start?

On the build box:

Change to the directory where you have your cookbook and run guard.

This will start up Guard, run any lint/syntax tests, kitchen creates all of your integration suites and platform targets and gets ready to run. Some sample output is below:

darron@: guard
11:33:24 - INFO - Guard is using Growl to send notifications.
11:33:24 - INFO - Inspecting Ruby code style of all files
Inspecting 16 files

16 files inspected, no offenses detected
11:33:25 - INFO - Linting all cookbooks

11:33:26 - INFO - Guard::RSpec is running
11:33:26 - INFO - Running all specs
Run options: exclude {:wip=>true}

Finished in 0.58145 seconds (files took 2.07 seconds to load)
9 examples, 0 failures

11:33:30 - INFO - Guard::Kitchen is starting
-----> Starting Kitchen (v1.2.1)
-----> Creating <default-ubuntu-1004>...
       Bringing machine 'default' up with 'virtualbox' provider...
       ==> default: Importing base box 'chef-ubuntu-10.04'...
       ==> default: Matching MAC address for NAT networking...
       ==> default: Setting the name of the VM: default-ubuntu-1004_default_1408728822930
       ==> default: Clearing any previously set network interfaces...
       ==> default: Preparing network interfaces based on configuration...
           default: Adapter 1: nat
       ==> default: Forwarding ports...
           default: 22 => 2222 (adapter 1)
       ==> default: Booting VM...
       ==> default: Waiting for machine to boot. This may take a few minutes...
           default: SSH address:
           default: SSH username: vagrant
# Lots of output snipped....
-----> Creating <crawler-ubuntu-1404>...
       Bringing machine 'default' up with 'virtualbox' provider...
       ==> default: Importing base box 'chef-ubuntu-14.04'...
       ==> default: Matching MAC address for NAT networking...
       ==> default: Setting the name of the VM: crawler-ubuntu-1404_default_1408729156367
       ==> default: Fixed port collision for 22 => 2222. Now on port 2207.
       ==> default: Clearing any previously set network interfaces...
       ==> default: Preparing network interfaces based on configuration...
           default: Adapter 1: nat
       ==> default: Forwarding ports...
           default: 22 => 2207 (adapter 1)
       ==> default: Booting VM...
       ==> default: Waiting for machine to boot. This may take a few minutes...
           default: SSH address:
           default: SSH username: vagrant
           default: SSH auth method: private key
           default: Warning: Connection timeout. Retrying...
       ==> default: Machine booted and ready!
       ==> default: Checking for guest additions in VM...
       ==> default: Setting hostname...
       ==> default: Machine not provisioning because `--no-provision` is specified.
       Vagrant instance <crawler-ubuntu-1404> created.
       Finished creating <crawler-ubuntu-1404> (0m45.51s).
-----> Kitchen is finished. (6m16.96s)
11:39:48 - INFO - Guard is now watching at '~/test-cookbook'
[1] guard(main)>

All of these suites and their respective platforms are now ready:

darron@: kitchen list
Instance             Driver   Provisioner  Last Action
default-ubuntu-1004  Vagrant  ChefZero     Created
default-ubuntu-1204  Vagrant  ChefZero     Created
default-ubuntu-1404  Vagrant  ChefZero     Created
jenkins-ubuntu-1004  Vagrant  ChefZero     Created
jenkins-ubuntu-1204  Vagrant  ChefZero     Created
jenkins-ubuntu-1404  Vagrant  ChefZero     Created
crawler-ubuntu-1004  Vagrant  ChefZero     Created
crawler-ubuntu-1204  Vagrant  ChefZero     Created
crawler-ubuntu-1404  Vagrant  ChefZero     Created

On your development box:

Once kitchen create is complete - if you've setup Dropbox and Growl correctly - you should get a notification on your screen. Here's the notifications I received:

In my case, Guard ran some syntax/lint tests, Rspec tests, and then got all of the integration platforms and suites ready to go.

Let's get our tests to run automagically.

In your cookbook, make a change to your code and save it.

Very quickly (a couple of seconds in my case), Dropbox will send your file to the other machine, Guard will notice that a file has changed and will run the tests automatically. If you're working on your integration tests, it will run a kitchen converge and kitchen verify for each suite and platform combination.

Once that's complete, you should get a notification on your screen - this is what I see:

If you're working on some Chefspec tests, this may be what you'd see:

To sum it up - this allows you to:

  1. Develop on one machine.
  2. Run your builds on another.
  3. Get notifications when the builds are complete.
  4. Profit.

If you've got a spare machine lying around your office - maybe even an underutilized MacPro - give it a try!

Any questions? Any problems? Let me know!

The recent octohost changes - where we're headed.

Late last year, octohost was created as a system to host websites:

  1. With no or minimal levels of manual intervention.
  2. With very little regard to underlying technology or framework.
  3. As a personal mini-PaaS modeled after Heroku with a git push interface to deploy these sites.
  4. Using disposable, immutable and rebuildable containers of source code.

What have we found?

  1. Docker is and incredible tool to take containers on Linux to the next level.
  2. If you keep your containers simple and ruthlessly purge unnecessary features, they can run uninterrupted for long periods of time.
  3. Having the ability to install anything in a disposable container is awesome.
  4. You can utilize your server resources much more efficiently using containers to host individual websites.

As we've been using it, we've also been thinking about ways to make it better:

  1. How can we make it faster?
  2. How can we make it simpler and more reliable?
  3. How big can we make it? How many sites can we put on a single server?
  4. How can we combine multiple octohosts together as a distributed cluster that's bigger and more fault-tolerant than a single one?
  5. How can we run the same container on different octohosts for fault-tolerance and additional scalability for a particular website?
  6. How can we persist configuration data beyond the lifecycle of the disposable container?
  7. How can we distribute and make this configuration data available around the system?
  8. How can we integrate remote data stores so that we can still keep the system itself relatively disposable?
  9. How can we trace an HTTP request through the entire chain from the proxy, to container and back?
  10. How can we lower the barrier to entry so that it can be built/spun up easier?

A number of these have been 'accomplished', but we've done a number of large changes to help to enable the next phases of octohost's lifecycle.

  1. We replaced the Hipache proxy with Openresty which immediately sped everything up and allowed us to use Lua to extend the proxy's capabilities.
  2. We moved from etcd to Consul to store and distribute our persistent configuration data. That change allowed us to make use of Consul's Services and Health Check features.
  3. We removed the tentacles container which used Ruby, Sinatra and Redis to store a website's endpoint. Due to how it was hooked up to nginx, it was queried for every hit so that it knew which endpoing to route the request to. The data model was also limited to a single endpoint and required a number of moving parts. I like less moving parts - removing it was a win in many ways.
  4. We refactored the octo command and the gitreceive script which enabled the launching of multiple containers for a single site.
  5. We added a configuration flag to use a private registry, so that an image only has to be built once and can be pulled onto other members of the cluster quickly and easily.
  6. We added a plugin architecture for the octo command, and the first plugin was for MySQL user and database creation.
  7. We replaced tentacles with the octoconfig gem that pulls the Service and configuration data out of Consul and writes an nginx config file. The gem should be extensible enough that we can re-use it for other daemons as needed.

So what are we working on going forward?

  1. Getting octohost clustered easily and reliably. At a small enough size and workload, each system should be able to proxy for any container in the cluster.
  2. Working on the movement, co-ordination and duplication of containers from octohost to octohost.
  3. Improving the consistency and efficiency of octohost's current set of base images. We will be starting from Ubuntu 14.04LTS and rebuilding from there.
  4. Continuing to improve the traceability of HTTP requests through the proxy, to the container and back.
  5. Improving the performance wherever bottlenecks are found.
  6. Improving the documentation and setup process.

What are some pain points that you've found? What do you think of our plans?

Send any comments to Darron or hit us up on Twitter.