I recently had an argument with some of my colleagues about the difficulty and value of developing operational run books for an IT shop. By “recently” I mean every single day of my life. Since I must defend the utility of a run book this often I wanted a link I could just send on every time the thread comes back from the dead.
Defining Some Terms
Before I launch into a diatribe there are enough definitions of “run book” out there that I need to explain the one to which I am referring. For me, a run book describes the common day-to-day operational management tasks for a complex hardware or software platform. These run books enable the administrator/operator to tackle the most common tasks using the most highly available tool in a consistent, repeatable and verifiable manner. I am not talking about migration run books or prescribed disaster recovery run books. These other documents serve very different purposes. I am only concerned with operations run books, whether for an application or a piece of infrastructure like server, storage or the physical plant.
Operations run books enable peer review, training, rigorous process alignment and most importantly, standardization of execution. If you use a good run book to provision, say, storage resources and get called away at step 7 of 10, anyone on the team can pick up precisely where you left off without so much as a word. This is the solution to the “if you get hit by a bus tomorrow” problem. It works. It works for a lot of reasons but one factor is that a good guide only documents the procedures and activities IT needs on a regular basis. Those regular activities are then customized to the specific client need. You do not rewrite the product manuals.
Now that we understand one another…
Is that a reasonable enough goal? I think it is unassailable, yet I get this comment all the time from teams who ultimately have to come around:
Why should we re-invent the wheel and devote resource time to creating something that already exists? Why not spend the time walking them through (hands on) each activity that is identified in the Administrator’s Guide?
Have people ever actually READ an administrator’s guide? The point of a run book is to translate the admin guide into actionable procedures and save them for posterity. The 5% of the admin guide that is useful day-to-day needs operational context. This is hard. This means knowing what is in the guides, what clients need to do every day (and what to do once a month, once a year or just once), how all this fits into their IT process, how to abstract a set of complex steps down to something repeatable and how to write it all in plain language that everybody can follow at 3am after 2 all-nighters and a chain of Sev-1s. It also means knowing what people do in IT operations and showing it.
It is hard to distill 1,000 pages of administration into 50-75 pages of critical mass that links a client’s provisioning, incident, change, config, release and service management process frameworks. It is also what our clients want. The lack of a run book is one reason why some of the hottest new technologies are sold and fail to live up to their promise or sit idle waiting for enough expertise to come along and manage the new beast. Clients have the admin guides. Admin guides assume you already know what you are doing; that is why they carry a 20 page index. Admin guides are critically important but they are a reference, not the gospel on the “right” way to do things in your specific shop.
A run book should be written assuming you just got handed a service ticket for some brand new infrastructure, today is your first day on the operator job, the building is on fire and your life depends on not screwing up. Clients make a good run book their official cubicle wallpaper in IT precisely because their jobs DO depend on not screwing up. On top of that, a good run book for your company’s infrastructure probably has little in common with the same infrastructure at another company. A good run book overcomes good habits developed in another context (or with another employer) which in yours are hideously bad.
Again: this is hard. If it were an easy task our customers would simply do it themselves. Instead, they have engineers performing system administration and Level-1 support when they should be designing new services and capabilities the business wants. Too many IT shops willingly pay tens of thousands of dollars in unnecessary costs acquiring and holding onto talented people who then become overpaid infrastructure babysitters instead of rock stars training the junior staff. Run books enable your premium staff to do the job for which they rightly receive premium wages. Run books save you money and sanity.
I smell hyperbole. Really?
Yes. I managed world class infrastructures with these documents when I rode the pager. We did it with a skeleton crew and managed to otherworldly performance and availability targets. You have not lived until you turn the keys to a million dollar storage array over to a kid 2 years out of college and tell him to migrate the 50TB R/3 core using nothing but the run books (ok, we did not actually go THAT far, but close). In a moment of egotistical weakness I like to think the only run books better than those run books have Semper Fi on the front cover. The Marines call them SOPs. If SOPs can guide 18 year old Marines to repair tanks, surely they can be used to codify a few provisioning tasks. Ooh-Rah. (Note: I probably could not repair a tank unless that tank were made of Lego, even with an SOP)
Not a lot of consultancies do this kind of work. Even fewer do it well because it takes a unique set of technical, operational and writing skills to vet and assemble the right content in an intuitive form. Once done, updating a run book is simple. They are living documents after all; users just need to see an example. I received a LOT of resistance to introducing this capability to the practice years ago. When I leave to go back to managing IT some day I will probably have the same resistance, but this time from engineers and senior administrators who feel like these documents devalue their talents. The opposite is true. It usually takes just one or two service requests the engineer did not have to do (because she is “the only one who knows how to configure that one thing”) and they are swearing by the run book.
But there is a catch, right?
A client called me last week and shared their concerns about one of our run books. I was ready to parachute a new team in but the client was quick to defend my people and indicated the problem was the result of their not having an internal delivery process to help guide the document construction. An outsourcer was in charge of operations and they did not want the provider to know the jig was up. We broke our own methodology at their request. I am still unhappy with the situation. That is the catch: do it wrong and you have done nothing more than create a dust collector for the bookshelf.
That is why we must learn the client’s operations processes, document the repeated activities in that context, show them the best practice procedures to complete the activities and test, test, test. If the user has to be an expert just to pick up the run book, we missed. That is the benchmark for creating a bulletproof guidebook.
Still, there will be naysayers. Adversarial product vendors will ask me, for example, “where’s the procedure for setting up SAN booting off my array/HBA/blade server” when the client does not and has not any intention of booting from the SAN. Some people will insist this is a low value tool and IT staff all need some coaxing to see the light but the results are worth the trouble. When the automated tools fail, the local “guru” is on vacation and the building is collapsing around you the run books will still be there.
The run book does not know that you are new to the job. It does not judge your two-fingered typing or how slow you read. It does not think less of you because you are a manager (at least it will not say it to your face). A good run book just works. It enables you to play your role and sometimes two above yours. It is a force multiplier when staffing is cut to the bone so deep you can see marrow.
Just like the Marines. Thank you, Devil Dogs, for the inspiration. This idea has saved IT’s bacon more times than anyone will ever know.