“To err is human, to forgive divine.” Alexander Pope
Because we are human, when we do our work, we make mistakes, omit steps, forget our place, but generally, we can go back and correct them without much consequence. For important manual processes, there are steps that should be taken that greatly improve the quality of the manual execution. First and foremost is a standardized and documented process supplemented with a checklist (!). Clarity of procedure, and the supplemental effect of the checklists on preparedness and teamwork have shown improved results in a wide variety of applications (medical, aviation, manufacturing, etc). To achieve even higher quality operational execution, when the stakes are most critical, four eyes principle is a best practice.
In the context of High Quality Operations, the Four Eyes is not a review or approval process (where two different people must review or approve a transaction), but instead is an operational procedure where two capable operators work together to step through a procedure and ensure the right operational steps are taken in the proper order. In essence, the pre-flight checklist that pilots go through before commercial flights is an excellent example of standard procedures, checklists, and four eyes (the pilot and the co-pilot). They jointly step through the inspection to ensure the plane is ready for flight. A mistake of course, could be critical with loss of life as a consequence. The outstanding safety record of commercial flying is well known and a direct result of these and other measures.
“To err is human, but to real foul things up you need a computer.” Paul R. Ehrlich
Similarly, for IT operations, measures to improve execution quality are employed in top quartile shops. As a base set of practices, any change to the production environment requires a change ticket (with approval). All changes, even routine or administrative changes will have properly documented procedures to be executed. Start and completion of changes is confirmed with the command center. Back out options are also included as part of the documented change.
But given the possibility that a human error, done unfortunately at exactly the wrong time, can via a computer, cause massive downstream impacts, where there is such a possibility due to the nature and criticality of the changes, more quality measures must be in place.
For more complex changes, checklists are added to the procedure documentation to ensure critical tasks are completed and in the right order. And for the most critical changes (e.g., core database updates or deletes, critical system recovery processes, etc), two capable operators or engineers are engaged to ensure the procedures are followed correctly with one executing the commands and the other sitting right beside him looking over his/her shoulder to ensure all is correct. This dramatically reduces the likelihood that a single operator, executing a complex set of tasks, makes a manual error, and causes a massive outage, loss of data, or other difficult to recover situation.
The four eyes principle should be especially utilized when data fixes, corrections or deletions are being applied as minor execution errors can often cause massive consequences due to the wrong records being deleted, or improperly indexed or data exposed or breached. And recovery from such errors can take inordinate effort. The change/fix should of course be properly designed with inspections (preferable) or peer reviews . The change/fix should be fully tested to the degree possible. And the change/fix should be documented fully with a checklist if complex. Finally, when executed, it should be executed with four eyes – two competent operators ensuring together the proper steps are executed carefully.
Wherever possible, additional automated tools and aids should be in place to ensure common human errors are not executed but challenged. Clear identification of the environment the commands are executing in or against is paramount. Further, automated warnings or cautions when executing broad commands is another good measure. These capabilities will help reduce typical errors, but custom fixes, major upgrades, and recovery procedures will not yield well to automation as they are normally single occurrence and custom-crafted. Thus, utilize the same effective practices used in life and death situations (medical, aviation, military) to enhance human execution quality for your crucial technology operations procedures.
Best, Jim Ditmore
For further reading: The Checklist Manifesto, here on Amazon.