Monitoring/Logs/Metrics Tools

I was getting a demo of a new tool/toolset from the guys at Stackdriver yesterday and we were talking about some of the monitoring/logs/metrics tools I was using. I listed a few, but after we finished the call I realized that I had actually forgotten a whole bunch that we use.

In the spirit of Thomas Fuchs' recent post, this post is an overview of some of the tools we use at nonfiction to monitor our servers:

  1. Pingdom - monitors our public web and dns servers - the public status report is always available. Alerts via Pagerduty if it detects a problem.
  2. Munin - on an ancient RHEL box - about to be retired - graphed basic server metrics for years.
  3. Monit - deployed via Chef - restarts servers if they’re not responding - local to each server - notifies via email.
  4. Datadog - deployed via Chef - creates all sorts of server utilization graphs (like Munin) by default, also ties in various integrations to show your whole environment and how it works together. You can also throw various metrics into Datadog, and they take care of storing and visualizing those metrics. My current favorite tool because of the ease of use and integrations.
  5. Papertrail - log aggregator that pulls all of your server logs together in one place: syslog, Heroku, random log files, etc. You can also alert for specific log patterns using Pagerduty - very handy for so many reasons and worth every penny.
  6. Servicenarc - a way to make sure various cronjobs are running as often as they’re supposed to. Based on Dead Man’s Snitch.
  7. Boundary - pretty amazing network visualization tool to show your network flows in pretty much real time. Don’t use it as often as I should but it’s pretty incredible when I do look at it.
  8. Denyhosts - watches for SSH password guessing and locks out IP addresses that are trying to break in.
  9. Logcheck - mails out “suspicious” log files - was a great tool in the past, but has largely been replaced by Papertrail for us.
  10. Airbrake - our Rails apps all have this integrated for error detection and logging

There are some tools I want to try out or take a closer look at:

  1. Sensu - looks promising.
  2. OSSEC - I had a basic installation running, but it was SOOOO chatty that I quickly ignored it - would like to see if I can get it to a reasonable balance of signal vs. noise.
  3. Stackdriver - looks interesting.
  4. Fail2ban - want to extend DenyHosts to FTP at least.
  5. logstash - turning logfiles into actionable data seems interesting

Also some other notable tools I have looked at in the past but don’t use at the moment:

  1. New Relic - using this on one project but not overall
  2. Tracealtyics - too much noise for us to be useful - may work better in other environments
  3. CopperEgg - liked it - worked pretty good for us
  4. Server Density - liked the iPhone app
  5. Splunk Storm - seemed super expensive but worked pretty well
  6. Loggly - worked great - just liked Papertrail better
  7. Scout - worked great
  8. Mod Security - too much noise for us to be useful - may work better in other environments
  9. Librato Metrics - powerful tool with a great team behind it

What do you guys use? Anything notable that I’ve missed that I should look into?