25 Jan 2016
As discussed in my Consul service discovery talk on at Scale14x on Saturday, figuring out a technique which uses Consul's KV store to move configuration files around has been pleasantly surprising.
We released kvexpress - which is a small tool that:
- Uploads data into Consul's KV store and prepares it for distribution - usually on a single node.
- Downloads that data from Consul's KV store onto a client node, verifies it, writes it to a file and then runs an optional handler.
This happens usually in one of two main ways:
- Kicked off from a Consul watch - makes the delivery process very quick and hands off. This takes a little more to setup - but after that setup it's pretty hands off.
- In an ad-hoc manner - you need to put something on a bunch of nodes quickly.
Here's a quick demo of how it works using the Consul watch. It shows how removing a node from Consul's service catalog updates a hosts file that's inserted and delivered by kvexpress:
We can see a few things from the graphs:
- The files on all 1188 nodes are updated quite quickly - most of them under 300 milliseconds.
- There's one node that takes between 4 and 5 seconds consistently - I think it's an overloaded logging node.
The insertion happens when Consul Template notices the
bunk service is disabled and rebuilds the template - Consul Template then hands off the final rendered template to
kvexpress for insertion.
After the file is inserted, it replicates through Consul's KV store and the Consul watches that are watching the key
kvexpress/hosts/checksum notice a change - which kicks off the
kvexpress out process that double checks the file, writes the new file and reloads dnsmasq.
An example Consul watch would look like this:
"handler": "kvexpress out -k hosts -f /etc/hosts.consul -l 10 -c 00644 -e 'sudo pkill -HUP dnsmasq'"
All of the commands we have used - and example versions of each are located here.
Here's another quick demo of how it works in ad-hoc mode.
In this demo, I am going to show:
- Grabbing a URL from a gist - it will be a 600 line configuration file.
- Installing that config file on 1200 nodes.
- During the same action - I will be removing the file - but normally you would restart the daemon or HUP a process.
kvexpress can help you to use Consul's KV store to make very quick changes to your cluster's configuration with safety and precision. There's additional kvexpress specific information in Saturday's talk it starts in the video at 44:30 and in the slides at slide 83.
05 Jan 2016
80 days ago, I decided that I would put real effort into learning to program in Go.
I had been working on something I had written in Ruby - from the original Bash script that it replaced - so I knew the problem space very well and I had my first potential project. As I finished the Ruby version, I realized that even though it was "correct" I had overlooked part of the problem space and I needed to extend it more if I truly wanted a comprehensive solution.
I didn't want to re-architect the Ruby version - I also didn't want to deal with adding the gems and a Ruby 2.x runtime to 1000 machines - so I thought I'd take a quick spike to see how quickly I could write it in Go. I'd written code in several languages with similar syntax - how hard could it be?
I created a private repo in my own account at Github and started hacking on Thursday night. After a cross country flight on Friday and some free time over the weekend I had binaries that were close to the same level of functionality as the Ruby version. I was very excited.
Some background might help here. During my almost 2 decades working with computers, I had worked with all sorts of different technologies and written code in many different languages - but I am not a developer. I'm much closer to a sysadmin / ops guy and I don't have any formal CS training. I studied theology and philosophy at school but the web ended up being my true calling.
When trying out a new programming language, I would sometimes buy a book, start reading and then try to "do it the right way". Gotta have tests! And those tests need to be mocked properly so that you can test without network access. And you need to make sure to write it in the style that the language is known for.
Nope - not this time - at least not at first.
I've half-learned all sorts of technology that way - gotten overwhelmed with the details that never quite came together - and was going to do this a little differently. I was not going to get stuck and give up.
Please don't misunderstand me - it's not that tests aren't valuable and that "doing things the right way" isn't a laudable goal. But I wasn't about to derail learning this tool because I couldn't put out perfect, tested and modular code right away. I will get there - but I need to read and write lots of code first.
I found some libraries to use and was going to start to build using a couple of pieces of reference material. I bought some books - but neither of them were actually available then - one of them isn't even done yet.
I looked through some Go intros and got the basics but better than that I started to write code - because I learn by doing.
And the code compiled, came together and worked. It was understandable and could easily be reasoned about. It was simple and organized into logical chunks and it functioned! The binary was significantly smaller and less cumbersome than my Ruby version, especially with all of its dependencies. I was able to refactor quite easily and so I did when it made sense.
I was pretty excited - this was fun again - but I was also freaked out about when I would actually have to show it to other people. I work with some of the smartest people on the planet and I knew:
- I write code, but I am not really a developer.
- I didn't use some of the distinctive features of Golang because I hadn't needed to yet. As somebody who reviewed my code early said - this was more like C code written in Go.
- There was obvious refactoring that could be seen by me - but what about the things I couldn't see yet? How many of those would I miss?
- There were no tests (yet). I didn't want to fall down that rabbit hole and not be able to climb out.
- We had talked internally about releasing my first Go project as an open source tool after my talk in January - scary.
That fear of failure of "not doing it the right way" had blocked me in the past but I was not going to let it stop me this time.
I needed to push past that fear of failure - that fear of not looking like I knew everything - because I needed to learn. I needed to go back to the beginning and be the student. How else do you learn? How else do you grow? I needed to not care about what Internet randos think about my coding style - or lack thereof. I need to be free of that as a concern in general.
I have no illusions that my code is the fastest, the best or the shortest. But I don't really care right now. I'm going to continue to learn, continue to get better and understand more - but I'm not ashamed of where I am at this very moment.
Because 80 short days ago, I had just picked up a new set of tools.
80 days later, we've deployed 3 of my creations into production where they perform their duty quite well.
80 days later, my newest project is being built with unit and integration tests from the start.
And I'm looking forward to the next 80 days of growing, learning and getting better at my craft.
I have a lifetime to learn new things - and I'm just getting started with Go.
Push through the fear - leave it behind - it's worth it.
29 Nov 2015
A little while ago, one of my oldest friends spent several months refining and launching EasyRedir, a URL redirection service he created to help solve some problems he was seeing.
He wanted a simple, easy to use service for managing URL and domain redirects, but most of the ones he saw were anything but - so, as is his custom - he created a really good tool and is offering it as a service.
In the past, I've gone about doing this sort of thing by building my own little tools, using mod_rewrite on Apache or updating web server configuration files, but I no longer have the patience for this - it's generally not an effective use of my own personal time.
William built the EasyRedir web application with a friendly Rails frontend and a custom lua powered nginx backend - all hosted on AWS in a well built and scalable fashion.
I generally like to use the best tool for the particular job - rather than get tied into a single provider for everything - it gives me much more flexibility going forward.
If you've got some vanity domains or need some redirects and don't want to worry about it - give EasyRedir a look. It's a great company, with solid technology under the hood that's focused on the problem.
Don't buy the big package of services from "insert-barely-capable-telecom-company-that-gives-you-free-hosting" - then you're locked in to their terrible services and it's a real pain to untangle it all later.
PS - Here are some of the small focused tools that I personally like to use:
- EasyRedir - for URL and domain redirects - they even have a free tier.
- dnsimple for domain names and dns hosting. I also have a couple of domains hosted at DNS Made Easy - but will probably move them over when I have a moment.
- Packagecloud - to host my apt repos - which are often built automatically by Wercker
- Papertrail - for log aggregation and tail-as-it-comes-in capability.
26 Aug 2015
I had the privilege to present today at Devopsdays Chicago. I condensed a proposed 30 minute talk down to 20 slides in an Ignite format. There's way more things I could say about Consul - but 5 minutes is just not enough time.
Below the slides, I've placed the transcript of what I had planned to say - hopefully the YouTube upload will be posted shortly.
Slides at Speakerdeck
Video on YouTube
I don’t know about many of you, but before I started working at Datadog, even though I loved monitoring, I monitored my boxes at the END of a project. Build the thing. Test the thing. Put it into prod. Oh yeah - monitor some of the things - and write the docs.
I would setup my tools, grab some metrics, get some pretty graphs, and I thought that I was doing the right thing. I thought that since I satisfied the “Monitor all the things” I had succeeded. Check.
Nope. In fact I had failed. And I deserved a wicked dragon kick from a devops ninja. I had missed one of the most crucial times to monitor my service - monitor it from scratch - think about monitoring it when the packages are still in the repo.
HOW ELSE DO YOU KNOW THAT IT’S DOING THE THING? EVEN WHEN YOU’RE BUILDING IT - IT MAY NOT BE DOING THE THING YOU THINK IT’S DOING. Monitoring last is a fail - you should be monitoring first.
Plan to monitor it before there’s even data. Especially if it’s big data. For an example, we decided we needed to prototype something that could help us. We had hundreds of VMs and 30 minute Chef runs were just too long to change a feature toggle.
We looked at a few options and Consul looked like it had the components we were after: Small Go binary, DNS and HTTP interface for service discovery, Key Value storage engine with simple clients, integrated Failure Detection - we were excited.
But we were also a bit afraid of it - this is a new tool. How much memory would it take? Would it interfere with other processes? Would it be destabilizing to our clusters and impact production? There were many unknown unknowns – the things we didn’t know that we didn’t know yet.
So we started as many of us do - by fixing staging. We read the docs, Chef’d up some recipes, seasoned to taste and got a baby cluster running. Now what should we monitor?
On the Consul server side, we started with a few standard metrics: Overall Average networking, networking / server and cpu / server. Great - we’ve replicated Munin - my work is done here.
We wondered if the agent would use up all of our precious memory, drive the OOM killer crazy and stop our processes? Nope. It didn’t do that either. None of our worries materialized - most likely because we weren’t really using anything.
But, as we worked with Consul, broke and fixed the cluster, we quickly found the 2 metrics that were the most important: 1. Do we have a leader? 2. Has there been a leadership transition event?
After a couple weeks of exploration and watching it idle without taking down the world, we thought: Staging isn’t prod. Let’s see how the cluster behaves with more nodes. It's probably fine.
Hmm. Thats not good. Sure looks like a lot of leadership transitions. Definitely more than in staging. How about we add a couple more server nodes for the increased cluster size?
Ah yes - 5 seems about right. Now that it’s all calmed down and we’re feeling lucky, I heard there’s this really cool app to try. Let’s run it on all our nodes and get up to date configuration files that reload our services on the fly.
My bad. OK - maybe that wasn’t such a good idea. Too many things querying for too many other things at once. Maybe building a file on every machine at once isn’t the right thing to do. There’s got to be a better way.
Ahh OK - that's much better. Let’s build it on a single node and distribute it through the integrated Key/Value store. No more unending leadership transitions. No more scary graphs. Much wow.
Because we monitored first, we can experiment and see the impact of our choices before they become next year’s technical debt. Because we monitored first we can make decisions with hard data rather than with just our gut feels.
Because we monitored first, when we ran into strange pauses - we could collect additional metrics and discover - individual nodes aren’t going deaf - the server is - and that’s affecting groups of nodes.
So please - monitor first - not last. Make sure that the thing you’re building is doing what you think it’s doing before it’s too late and you have to do a 270 away from certain peril.
Monitor first - just never know what Shia might do if you don’t.