Chef is a distributed system configuration management tool. In plainer English, Chef is a system of computer programs designed to let a person or organization manage the setup and configuration of lots and lots of computers in an automated, repeatable and recoverable fashion. It is akin to bcfg2, cfengine and puppet in this regard.

What it's for?

As the use of computers shifted in enterprises, maybe fifteen years ago, from the use of larger machines to the use of large farms of 'standard' microcomputers, a new problem arose. The mainframe and minicomputer model meant that the setup and configuration of the computer itself was a rare event, only done at initial setup and then again only if the computer suffered a horrific setback - and those machines were engineered so that that almost never happened. When it did, operations generally ceased until they could be recovered. Those who cared about that had a second computer that could handle the load until the main one could be repaired - but that was super expensive.

When farms of smaller machines started to come into use, a new problem arose. These new machines also needed setup and configuration - and indeed, sometimes needed more setup and configuration than a mainframe with its few configurable options. Not only that, but there were suddenly a whole lot of computers needing to be set up and managed - and when new versions of operating systems came out (which started to occur more and more frequently) the workload of applying those changes skyrocketed as well.

Just to make things worse, one of the whole selling points of the smaller machines was that if one or two died at any point, it wouldn't really affect a well-designed infrastructure. Other machines would take up the load while new or standby boxes were quickly configured to handle the tasks the failed ones had dropped. This, of course, meant more setup and configuration load, because smaller computers do fail. All the time. As the drive continued for the machines to become cheaper and more generic, their components became consumer grade and failure became expected rather than avoided. It was cheaper that way.

So, fifteen years later, here we are. System administrators (Ops) have to spend a great deal of time configuring machines, patching machines, and so forth. Then things got even worse. Virtualization hit the mainstream, and a 'computer', rather than being something you ordered and got out of a box and stuck in a rack became something you clicked a button online and got a la carte. You could 'create' dozens, hundreds or thousands of computers in no time at all to handle failures, to handle load increases. And you could destroy them in even less time. And all those computers needed to be configured!

Enter configuration management. If computers can now be reduced to code, the code required to 'spawn' new instances from whatever virtualization vendor you're using, then it makes sense for their management to be reduced to code as well. This is what Chef does- it allows Ops to write code which describes the desired state of systems, and then works on its own to apply those configuration states to running machines. Chef can itself spawn virtual machines, if desired, and then configure them all the way up to and including deploying code and starting the server - so that with one click, or completely automatically, a properly set up Chef system can spawn a virtual server (on Amazon EC2, or a VMWare cluster, or Rackspace, or other providers), wait for it to boot, and then bootstrap its own code onto the system, apply any operating system updates necessary, install any software required, deploy current versions of application code, start the server process, and insert the new server into a load-balanced cluster. All without human intervention.

What does it do?

Chef has two modes - chef-server and chef-solo. Chef-solo is just that - in that mode, Chef operates purely as a complex scripting language on a single machine, and configures that machine as requested. This is advantageous in that you don't need any infrastructure to make this work, but it's not as capable - you're forced to load your entire configuration script set onto each and every server. If you change your desired config, you then have to update the scripts on every server running chef-solo. There's no easy way to determine the state of your total infrastructure.

In server mode, a Chef server (which is basically a message queue, nosql database and web server glued together with some Chef code) keeps track of all the configuration information for your entire infrastructure. Whenever a new computer is spawned (or, for that matter, unpacked, racked, stacked and turned on) that computer is induced to first contact the Chef server and declare what role it is meant to be playing. With that information, the Chef server constructs on the fly a list of configuration actions for that server to carry out (a runlist) and sends them to the server, which then executes the runlist in one sequence. If anything doesn't quite end up the way it should, Chef runs are intended to be idempotent - which means that you should be able to run the same runlist over and over again, and the state of the server should converge on the desired state. If the server is in the desired state, then executing the runlist should do nothing.

The advantage here is that if something goes wrong, the machine is not then broken and in need of human assistance. So long as what went wrong has not damaged the basic OS configuration (and it's hard to do that by mistake) then the next time the client is run, Chef will pick up where it left off, and try to fix the mistake by trying once more to make things look 'like they're supposed to.' For example, if during a run a server is told to retrieve some software from another server or an internet URL, and that address isn't reachable due to a temporary network glitch, the Chef run will complete without that software being installed. That software being missing might prevent other parts of the runlist from executing properly. But the next time the Chef client is run on the server (and the usual practice is to have it run in daemon mode, checking with the Chef server and performing a Chef run every few minutes or hours) then if the network address is working, the server will succssfully pull down its required software and pick up where it left off.

This is incredibly powerful. It means that if you are using fully virtual compute instances, your entire infrastructure suddenly becomes code rather than hardware (at least, from your point of view. Somebody has to make sure the iron is able to provide you with those virtual instances, but that's not yourproblem). If your software is smart enough, it can monitor itself for load - and if the load gets too high, it can simply ask the Chef system to spawn new servers to help handle it. If the load drops, it can just destroy some servers that are idle. And when those servers come up, they can be fully managed by pre-written configuration code. No humans necessary during the run; humans only necessary to design the system and maintain the infrastructure architecture, not to maintain individual configurations.

What does it look like?

Chef is an open-source software system, whose maintenance is managed by a company named Opscode which was (AFAICT) set up for that purpose. Opscode also has a managed platform, which is basically a big Chef server set up as a SaaS offering. This lets you pay them a monthly fee rather than have to worry about managing Chef servers yourself. Or, of course, you can set up your own server using their code without paying them a cent.

Chef is written in Ruby, and is steeped in that language. Chef configuration instructions are in units called "cookbooks" (ha). The name jokes abound; the command-line management tool is called "knife", and individual scripts are called "recipes" (oh ho, hoo, ha.) These cookbooks contain recipes as their base logic, but also contain templating resources, files that may be installed on the running systems, code snippets to extend Chef's logic, and all manner of variables. Perhaps the most powerful bit of Chef is that variables (called attributes as per Ruby) are fully namespaced. The top namespace is the Chef server. In other words, it is possible and easy to set attributes on the server which any other node can see and use and/or reset if necessary. These attributes can be namespaced by cookbook nme, by node (a 'node' is a computer being managed by Chef), or in any other namespace desired.

For example, if all your nodes are going to install the Apache webserver, it is possible to create an apache namespace which contains all the variable values necessary to install apache. Let's say you've decided you want all your nodes to run with at most ten apache processes running at all times. Well, that setting is handled by the apache config file. You would place a template in the apache cookbook which contains the proper line to set the number of running processes - but instead of placing a value in the line, like this:

MaxSpareServers 10
...you would use a template file. Chef's templates are written using Embedded Ruby, so you would end up with something like the following:
MaxSpareServers <%= [:apache][:maxspareservers:] %>
Then, on the Chef server, you would set the attribute "maxspareservers", in the "apache" namespace, to 10 or whatever. When Chef runs on the node, it will write out the template file, and replace that strange looking sequence with the value that it gets by evaluating the expression as Ruby - in this case, the number '10.'

The namespacing's power becomes apparent when you start using roles on the server. Roles, you see, let you override attributes according to a complex but powerful precedence system. So let's say the server that has come up has told the Chef server that it wants to be registered as role 'webserver'. On the chef server, 'webserver' has a list of cookbooks/recipes associated with it to be run on any machine that tells the server it is a 'webserver'. Well and good. However, roles can also have attributes set! So let's say that while we install apache on a whole bunch of machines, only the webservers need to have as many as 10 server processes running, whereas the others might only need 1 or 2 in order to report maintenance information. In that case, we could set the value of [:apache][:maxspareservers] to 2. But in the role information for 'webserver' we could set that same attribute - [:apache][:maxspareservers] to 10. So when a computer which is registered as a webserver asks the chef server for the value of that attribute, it will get told 10.

This, of course, is only the smallest taste of the kinds of run-time customization that Chef is capable of. The point, though, is this. Now, once your organization has decided on a 'standard' way of installing Apache, rather than writing a manual so that ops can 'follow the book' or writing a script which they then have to upload and run on every machine, and which doesn't take into account what else is on the box, you can write a Chef 'apache' cookbook which does exactly what you want. Then you can call that cookbook for any role which you've decided needs to have apache installed on it. If you're really cooking with Chef, you can write the cookbook to be able to handle installing apache on many different types of systems - it can be made smart enough to know how to install apache on Windows as well as on Ubuntu linux, BSD and Mac OS X. Then, later, anytime you know you want apache on your server, you just tell your role or your cookbook to call the 'apache' cookbook, and pass it only those bits of information which are apache-specific - MaxSpareServers and the like - and need not care a whit about what type of system it's being installed on.

This, friends, is power.

This sounds awesome! Is it?

Yes. It is. There are, of course, gotchas, however. Getting an infrastructure up and running with Chef involves a lot of front-loaded work. If someone comes in and says "WE NEED A SERVER UP IN HALF AN HOUR THAT DOES THING X", you may not have time to properly write and test a chef cookbook set or role that will do that. On the other hand, once you get server X working the old-fashioned dirty way, then it might behoove you to go back and write the chef set to do it automatically - and the next time someone asks you for one, you can click a button and say "Done."

Chef is a work in progress. There are some areas where it is deficient. In my opinion, this is nowhere more apparent than in error handling. At present, if any step of any part of the runlist fails (the process exits with a return code other than 0) the entire chef run fails. It doesn't exit and return error information to the server - it just dies, usually spewing huge chunks of Ruby backtrace information. What this means, in practice, is that writing a Chef cookbook or other resource involves being forced to do all your debugging up front, because if you push that recipe live to your infrastructure with an error in it, every machine that runs it will probably stop working properly - every time the client runs, it will error out and stop, not finishing any other steps it has yet to do.

Chef is Ruby-centric. This is good and bad. It's good because Ruby is a standard programming language; you're not asked to learn a Chef-specific scripting language. It's good because Ruby is fairly powerful, and because this lets you access the deep guts of the Chef system itself at need in a standard and comprehensible manner. It's bad because Ruby has several characteristics that I for one am deeply suspicious of. For example, one mantra of Ruby is 'trust the programmer.' I don't trust programmers; I'm an Op. Ruby will let the user (in this case, the Chef cookbook writer) shoot themselves in the foot in all number of ways. It also will let you do the same task any number of different ways; like Perl, there is 'no right way to do it.' While that may be empowering, in a job area (Ops) which requires standards, organization and comprehensible documentation, it becomes very hard to enforce those standards.

What's the upshot?

The upshot is that if you're willing to invest the up front time and energy, not just in implementing Chef but in thinking carefully about your organization's processes and needs, then Chef can be a lifesaver. It will let you manage unthinkably large numbers of servers, both iron and virtual, with a minimum number of humans in the loop for day-to-day operations. It will let you make changes to your infrastructure with a minimum of fuss.

However, it will also suck up a lot of overhead resources to properly implement, and it may be overkill for your needs. Be sure it matches your situation.

Where do I get more information about Chef?

This is the internet. There's a wiki, of course.

Iron Noder 2010