Risk Management:

5 Step Process:

Identify hazard
Assess risk
Make risk decisions
Implement controls
Supervise & Monitor

3 Step Process

Ask
Assess
Act

2 Step Process

Crosscheck
Control

The risk management process is a continuous cycle, it does not have a beginning or an end. It is important when managing change to evaluate a process and ensure that the most effective, training, equipment and procedures are in place within a well designed organizational structure. By monitoring the operational environment and doing this on a regular basis, it becomes easier to spot threats to the operation. Another benefit is that it becomes easier to plan for future operations ans spot opportunities to optimize performance of humans and machines.

Recommended Reading:



Reason, James 1990. Human Error. Ashgate, Burlington, VT.
Reason, James 1997. Managing the Risks of Organizational Accidents. Ashgate. Burlington, VT.



Practical Risk Management
by Kent B. Lewis


“We realize that as tragic as past accidents are, the lessons learned from those disasters must not be forgotten. In order to avert future accidents, all causes of a mishap must be vigorously pursued so that preventative measures can be found and implemented” (Wallers and Sumwalt 2000, xxiv).
-Aircraft Accident Analysis: Final Reports

Introduction
Practical risk management is about realizing that tragic mishaps also lay in our future, unless the multiple hazards that combine to create risk are identified and controlled. The risk management process is the “engine” that drives a generative safety management system (FAA 2008). As part of this system, hazards are proactively identified by systems experts prior to mishaps in order to create information about risk. “Information is best understood within the context where the individuals make choices, i.e. utilize the information” (Leng, 2009). Armed with this unique perspective, better strategic risk decisions can be made, reducing the need for risky and reactive tactical risk decisions. Through valid risk analysis and assessment, preventative measures are crafted and implemented that eliminate or reduce the severity and frequency of mishaps.

Much has been written about risk management and there are many models from which to choose when embarking upon a quest to place risk at an acceptable level within an organization. Several models will be explored here with examples of each model to demonstrate real world applications, including models from the Department of the Navy, the Federal Aviation Administration (FAA) and International Civil Aviation Organization (ICAO). These models are part of a systematic process that has integrated key elements. These elements include the definition of a system that is to be studied, identification of a problem or hazards, analysis of risk derived from the data, a quantitative and/or qualitative assessment of risk, decisions on risk controls and monitoring of the system to ensure desired results.

The objective of this chapter is to examine some real world examples of risk management, but it will be beneficial to take a moment to review the foundational concept of risk management. The risk management process may be as simple as a two-step process, such as the operational process used in instrument flying that involves “control” of the aircraft and a “crosscheck” of flight parameters. “Control and crosscheck” is a very simple, descriptive model, but there are many complex and dynamic factors that must be managed during this process in order to ensure that control of the aircraft and that the desired flight path is maintained at all times. This is one elegant example of risk management and risk assurance working closely in concert. Another model suitable for time critical risk management is the FAA Industry Training Standard (FITS) 3-step model “Perceive, Process, Perform”. A common theme found in all risk management models is the use of looping process, essential to continuous learning and system improvement. One of the most famous looping models is the “OODA Loop” a four-step conflict resolution process that consists of Observation, Orientation, Decision and Action. This decision making model was developed by military strategist and United States Air Force Colonel John Boyd, and has been widely applied as an information management resource to organizational operations and continuous improvement processes (Schecthman 1996, 32). A model that is widely used in deliberate operational risk management is a five-step process (FAA Flight Standards SMS Program Office 2008). The five steps are:
1. System Analysis (Design)
2. Hazard Identification
3. Risk Analysis
4. Risk Assessment
5. Risk Control
This model contains the key steps that most risk management models contain and is scalable for use from cradle to grave on any large system or small project.

For consideration of our case studies in this chapter, we will use parts of a five-step model that has been in use by the U.S. Department of the Navy for over 15 years. Its genesis resides in lessons learned from U. S. Army Operational Risk Management and the Systems Safety Program requirements of Military Standard 882. “This military standard addresses a wide range of safety management and safety engineering issues” divided into the groups of program management and control, design and integration, design evaluation, compliance and verification (Naval Safety Center 2010). This standard was chosen for use in the Naval Aviation Safety Program because human factors experts recognized the success that was had when a systematic approach was taken to reducing material factor contributions to mishaps. At the time the influence of material factors was decreasing because of systematic improvement in technologies, and human factors were becoming increasingly prevalent in system failures. There was a need to enhance the human factors aspects of accident investigations and it was time to adopt a scientific model for widespread use throughout all spectrums of the system. Not only did a new framework need to be developed for mishap investigations, there were also benefits to be realized from applying this framework as a proactive risk management scheme, as “such a framework would also serve as a foundation for the development and tracking of intervention strategies so that they can be modified or reinforced to improve safety” (Shappell and Wiegmann 2003, 19). A new focus emerged in the human factors aspects of system performance that complemented a strong engineering focus, and practitioners began work in earnest to apply this model to improve operational safety. The five steps of the Naval Safety Center risk management model are:
1. Identify hazards;
2. Assess the hazards;
3. Make risk decisions;
4. Implement controls, and;
5. Supervise and watch for change.
(Naval Safety Center 2009).

Notice any patterns emerging yet? These models are basically adaptations of the scientific method, a hypothesis is formed (a problem identified), data is gathered and evaluated, solutions are implemented and then evaluated. The strength of this approach is to capitalize on the use of common sense within a common strategy and it has widespread system applications (Budd 2003, 35).
Keep in mind that another common theme in all of the models is the goal is not necessarily to eliminate risk, but to manage risk so that the operation can be completed with the optimal balance between production and protection. Several key considerations must be factored into risk management, such as making decisions on risk acceptance at the appropriate level, when it is beneficial to use strategic or tactical levels of risk management and prioritization of resources to participate in the process. One additional key to choosing a model for an organization is to choose a process that is simple and can be understood by those who will use, to ensure that it will actually be used. There is no quicker way to increase financial risk to an organization than to implement an unwieldy risk management process as part of a safety management system. Risk management must be made as simple as possible, because if it is hard, people will avoid doing it.

Before considering some practical examples, remember that risk management is an integral part of a safety management system, a system that consists of written procedures and plans coupled with policy, safety assurance and promotion of safety programs. The goal of SMS is to prevent loss of life and property while conducting daily operations, and this is accomplished through the detection and mitigation of hazards. Risk management forms the foundation for an effective SMS, regardless of size, mission or resources of the organization, team or individual. Using the five-step process as a guide, we’ll now review some examples from the past, present and then look to the future.

Step 1. Identify hazards
Case Study 1: For this example, picture yourself in the pilot’s seat. You are the Instructor Pilot in a T-34C, flying in South Texas in a tandem seat turboprop trainer with a primary flight student under your tutelage. Your mission today is to introduce the student to touch and go landings and your destination is Aransas County Airport, a civilian airfield in Rockport, Texas. It will be a challenging day because the airport is also utilized by other military and civilian aircraft, and visual lookout for other traffic is a high priority. Given the tandem seating arrangement of the T-34, while flying from the rear seat you will have a great view of the back of your student’s head and limited visibility to either side of the aircraft. It is your student’s sixth flight ever in an aircraft, so his visual lookout skills are not very developed. Compounding the challenge is the fact that your aircraft radio is designed to work only on military frequencies, so you will not hear radio transmissions from the civilian aircraft operating to and from the uncontrolled airport. But it is a nice day, so off you go for a fun day of flying.
The transit to Aransas County goes smoothly, and you are on a base turn for landing on runway 36 when a helicopter unexpectedly appears below your right wing on a collision course. You immediately add power and climb back to altitude to avoid the collision. What happened, and how did the risk controls in place fail to protect operators at a critical time? How did we almost lose 3 pilots, 4 passengers and two aircraft on a clear, sunny day? And given this scenario, do we need to wait until we have another near midair collision, or should we take steps to ensure an appropriate level of safety within the system? What is our next step?

Our next step is to identify hazards. We have learned from definition that hazards are conditions with the potential to cause personal injury, death or property damage. These conditions can lay dormant until acted upon by external influences and changes to the environment. Hazards may exist for many reasons, as a result of poor design, improper or unprofessional work or operational practices, inadequate training or preparation for a task or mission, inadequate instructions or publications, or because the environment is demanding and unforgiving (Op NavInst 3750).

Our case involves Naval Aviation flight training and potential for midair collisions. In the early days, flight training bases were situated in isolated areas and there was not much interaction between civilian and military aircraft. The Naval Air Station in Corpus Christi, Texas is home to several flight training squadrons, and during the course of training crews operated in the local area and also flew in and out of civilian airfields. Initially the risks associated with this training and midair collisions was assessed as low, because while the consequence of a collision was unacceptable, the frequency of exposure was very low. Through the years, though, both military and civilian air traffic increased, and reports of near midair collision increased. This was obviously a problem, so the first steps that the Training Command took was to station a Runway Duty Officer at the airfield who would relay information on civilian aircraft position to squadron aircraft using a military and civilian radio. There was also a stated policy that no more than four Training Command aircraft would use the airport if civilian aircraft are present. Problem solved, correct? Possibly not, in fact the addition of an extra person to relay time-delayed communications only added to the already busy workload as an instructor and degraded training efficiency. A long-term risk control was being developed to mitigate collision risks, but this collision warning system would have limited functionality in the dynamic environment of an airport traffic pattern.

The first element of this risk management model is to identify hazards within the chosen system, and to not only identify all hazards but to take immediate actions necessary to maintain the appropriate level of safety within the system. In this case you took immediate action as an Instructor to avoid the collision, and upon return to the Air Station it was time to report the near midair collision to the safety department, which was in accordance with both military and civilian transportation safety regulations. The military has a hazard reporting system that is similar in functionality to the National Aeronautics and Space Administration Aviation Safety Reporting System (NASA ASRS), where demographic information and a description of the event are provided for further analysis and assessment. Given the information we had so far, our next step was to identify hazards, what factors are present here that could cause injury or damage? In this case, the identified hazard was the lack of a radio that operated on civilian frequencies. Without this equipment, pilots were left to use basic “see and avoid” techniques, a procedural intervention at the lowest level that is subject to variations in human performance. So we had a hazard with potential severe consequence, and through hazard reporting it was discovered that several similar events had been reported at other joint use airfields throughout the training command. Utilizing this information and a risk assessment matrix, the hazard was assessed as a severe hazard that required system interventions. With this risk management model, when a severe hazard at level one or two was discovered, a report had to be generated to notify the appropriate level risk managers within 24 hours. This report also included recommendations for corrective actions that were generated with input from the squadron’s Instructor Pilots, in this case we recommended continuing education on the midair collision potential as a short term control and acquisition of a dual band UHF/VHF radio as a long term system solution.




RISK ASSESSMENT MATRIX
PROBABILITY
A
B
C
D
Likely
Probable
May
Unlikely
S
E
V
E
R
I
T
Y
I
1
1
2
3
Death,
Loss of Asset
II
1
2
3
4
Severe injury, Damage
III
2
3
4
5
Minor Injury, Damage
IV
3
4
5
5
Minimal
Threat

Probability
A – Likely to occur immediately or within a short period of time.
B – Probably will occur in time.
C – May occur in time.
D – Unlikely to occur.

Severity
I – May cause death, loss of facility/asset, mission failure.
II – May cause severe injury, illness, property damage, severe mission degradation.
III – May cause minor injury, illness, property damage, minor mission degradation.
IV – Minimal threat, no mission degradation.

Risks assessed as a 1 or 2 are considered severe hazards that require immediate remedial actions.

So what was the resolution? Based on inputs from squadron pilots through the hazard reporting system and advocacy efforts NAVAIR and the Navy’s Office of Safety and Survivability, money was appropriated to acquire dual band radios for the entire T-34C fleet. The total cost of the retrofit was under 2 million dollars and because the trainer was not a combat aircraft and a commercial off-the-shelf system was immediately available. The upgraded radio system now not only offers an enhanced level of safety, it also increases the quality of training by exposing military flight students to civilian air traffic control communications throughout the entire national airspace system, an increase in both training effectiveness and system safety.

The most important step in risk management is identification of hazards. It is impossible to be proactive and implement risk controls if hazards remain unseen until discovered in a mishap investigation. There are many ways to identify hazards, and the most effective risk management systems are those where hazards are identified by reporters who are most familiar with the system risks. “The information we are seeking is stored in human heads and in books and data banks. Moreover, the information in books is also indexed in human heads, so that usually the most expeditious way to find the right book is to ask a human who is an expert on the subject of interest” (Simon 1997, 243). Hazards can be identified by individuals or by teams, such as safety councils or audit teams as part of the assurance process, and part of the challenge here is to motivate people to identify and report hazards. There must be an informed, learning, reporting culture supported by a foundational safety policy and an easy to use reporting system established. In small organizations this may be a written report, and in large organizations there will most likely be web based reporting portals. Throughout the air line industry there are additional conduits of information through programs such as the Aviation Safety Action Program (ASAP), Air Traffic Safety Action Program (ATSAP), Flight Operations Quality Assurance (FOQA) Program and Advanced Qualification Program (AQP) that provide quantitative and qualitative data to both the NASA ASRS and FAA Accident Analysis and Prevention (AAP) programs. This information is still valuable in a traditional reactive sense, and is also becoming increasingly more valuable as a knowledge development tool in proactive and generative safety systems.
These systems essentially conduct “pre-mishap” interviews, and predict what “could happen” versus waiting to see what will happen. Text mining programs are being developed to look for patterns in operational data, signals in the noise. Has it happened before, i.e. frequency? Could it happen again? What are the potential consequences, severity? And what can we do within the system to prevent it from happening? Environmental scanning is a key element in any business plan, and what better area to invest time and talent than in areas that discover threats to the system before mishaps occur. To improve our reporting, it is also necessary to train people to think like investigators, so that when they come across hazards they can identify them and begin the risk management process. It is also necessary to have an individual and organizational culture that embraces generative reporting and learning. Trust must be built and shared between groups with different goals, and a staggered approach. The incentives to encourage reporting are that individuals and organizations will have a good idea of what is going on, people will feel empowered to concentrate on doing a quality job and resources can be prioritized to fix what needs to be fixed (Dekker 2007, 26).

Future information systems will emerge from simplistic functionality and capture the advantages of a collaborative web 2.0 and 3.0 environments. In these environments people can share and shape information in the metaverse, which is needed for advancement of global safety management systems. Improvement of safety at a systems level is the next imperative, organizations and regulators must reach across boundaries and share standards in an open, learning environment. Valid, timely hazard reporting is necessary to support the next element in risk management, which is risk assessment.

Hazard Identification Snapshots
There are many web based information services that collect, store and disseminate safety information. Some excellent examples of environmental scanning on the web are the Flight Safety Information newsletter at http://www.fsinfo.org, signing up for google alert keywords at http://www.google.com/alerts, aviation safety wikispaces such as http://www.signalcharlie.net, the NASA ASRS online databases and reporting system at http://asrs.arc.nasa.gov and SKYbrary, a web reference for aviation safety knowledge at http://www.skybrary.aero

Factoids
In 2009 NASA’s ASRS received over 50,000 reports, the most in its 35 year history. That is a safety report every 10 minutes.
-A major air line receives close to 10 reports a day.
-A flight museum may receive 1 to 2 reports per year, and has a web-based SMS. (Vintage Flying Museum 2010).

Step 2. Assess risks
Case Study 2: On April 12, 2007, about 0043 eastern daylight time, a Bombardier/Canadair Regional Jet (CRJ) CL600-2B19, N8905F, operated as Pinnacle Airlines flight 4712, ran off the departure end of runway 28 after landing at Cherry Capital Airport (TVC), Traverse City, Michigan. There were no injuries among the 49 passengers (including 3 lap-held infants) and 3 crewmembers, and the aircraft was substantially damaged. Weather was reported as snowing. The airplane was being operated under the provisions of 14 Code of Federal Regulations (CFR) Part 121 and had departed from Minneapolis-St. Paul International (Wold-Chamberlain) Airport (MSP), Minneapolis, Minnesota, about 2153 central daylight time (CDT). Instrument meteorological conditions prevailed at the time of the accident flight, which operated on an instrument flight rules (IFR) flight plan.
The National Transportation Safety Board determines the probable cause(s) of this accident as follows: The pilots' decision to land at Cherry Capital Airport (TVC), Traverse City, Michigan, without performing a landing distance assessment, which was required by company policy because of runway contamination initially reported by TVC ground operations personnel and continuing reports of deteriorating weather and runway conditions during the approach. This poor decision-making likely reflected the effects of fatigue produced by a long, demanding duty day and, for the captain, the duties associated with check airman functions. Contributing to the accident were 1) the Federal Aviation Administration pilot flight and duty time regulations that permitted the pilots' long, demanding duty day and 2) the TVC operations supervisor's use of ambiguous and unspecific radio phraseology in providing runway braking information (NTSB AAR-08-02, 2008).

There were many situational hazards present during this operation, and the confluence of these factors resulted in a mishap. Fortunately no one was injured and the airplane returned to revenue service after the nose landing gear and pressure bulkhead were repaired. When all was said and done the passengers were not even significantly delayed and some did not even realize there was a mishap until they were asked to deplane and ride a bus to the terminal. This mishap is a textbook example of why hazards should be identified, risks analyzed and assessed and control decisions made before an operation is conducted. Just by reading the terse description of the mishap from the NTSB Probable Cause statement we begin to see the precursors, the latent conditions that existed before the mishap crew reported to the airport for duty and in the case of pilot flight time and duty time regulations, existed even before the Captain or First Officer were born. Multiple factors were identified in this mishap beginning with crew experience. The Captain was a Line Check Airman who was giving Initial Operating Experience training to a new First Officer, which happens every day in air carrier operations but as we will discover, perhaps this was not the best time or place to schedule or conduct this training. The crew was qualified to operate the aircraft and there were no mechanical issues with the airplane, so for initial risk assessment purposes we have elevated potential for incidents of a new First Officer balanced by the presence of an experienced Line Check Airman. The crew had a relatively normal first day and reported for duty on the second day with a challenging full schedule ahead. The crew flew 4 flights and was in position in Minneapolis for the fifth scheduled flight to Traverse City. At this point in their day the crew had been on duty over 12 hours and had flown over 6 hours, and by the time they arrived in TVC they would have been on duty for close to 16 hours, with flight time exceeding 8 hours. Add to this that the flight would be conducted during a known window of circadian low, with estimated time of arrival at the airport set for after midnight. We’ll pause again here to see if there are yet any identified hazards that may be of interest to supervisors, flight operations planners, meteorologists, dispatchers, airport operations personnel and pilots. As it turns out there were, and while the Captain communicated frequently with the Dispatcher on rapidly changing weather conditions, the airport personnel in Traverse City continued efforts to keep the runways plowed and the airport open for air traffic. Initial weather and runway condition reports caused the flight to be put on hold for several hours, as reduced visibility required the use of a specific runway with an Instrument Landing System, but that runway had a tailwind and the braking action was also being reported between Good and Fair. Some system controls were in place, operational specifications prohibited use of that runway with tailwinds and reduced braking action, and the reduced visibility precluded the use of a visual approach to the airport. Many of these factors are considered during time critical risk management and risk controls are developed, so these were not extraordinarily unique conditions when considered in isolation. For this event though, there were many hazards beginning to coagulate and the system was not robust enough to predict the increasing level of risk. To recap, we have a low time First Officer on day two of line flying, with the crew approaching both the 16 hour duty day and 8 hour flight time limit, planning a flight into an airport that had battled winter weather conditions throughout the evening and into the early hours of the morning. That is the nature of air transport flight operations and these risks are managed successfully hundreds of times every day throughout our global system. But we are not quite through with the risk assessment for TVC, as we need to also consider the fact that the runway is relatively short for air carrier operations, 6500 feet with a Medium Intensity Approach Light System with runway alignment indicator lights (MALSR) and a 4 light Precision Approach Path Indicator (PAPI). And just to round out the evening’s festivities, the air traffic control tower was closed and braking action reports would have to be relayed to the inbound aircraft by airport operations personnel. All of these hazards were known before the mishap, there were no surprises here, just people who were used to getting things done and moving people from point A to point B in a safe manner. No one showed up to work that day and decided to see how for off the end of a runway a regional jet could actually travel, and the airport personnel certainly did not have emergency response and extrication of a RJ from the mud high on their list of desired activities. As a matter of fact, the airport operations personnel in TVC had been recognized by the Air Line Pilots Association in previous years for their proactive stance on safety enhancements to the airport ground environment.
With all of this information in hand, a time critical risk assessment could be conducted and most likely it would score in a category that required actions be taken to reduce the level of risk. There was a known risk for this type of mishap, in fact hardly a winter season goes by where several similar mishaps do not occur. An important element of any business plan is to scan the environment, and by doing so we would know that the probability of such an event given these conditions was that it would probably occur or may occur in time. Next we would consider the severity and arrive at the conclusion that a runway excursion caused by winter conditions would have the potential of causing severe injury and/or severe mission degradation. For this assessment we are using quantitative data to establish probability and qualitative judgment to place severity. Utilizing the Naval Safety Center risk assessment matrix, our risk level would be placed at a high or moderate level and we would need to take actions to reduce this level of risk.

RISK ASSESSMENT MATRIX
PROBABILITY
A
B
C
D
Likely
Probable
May
Unlikely
S
E
V
E
R
I
T
Y
I
1
1
2
3
Death,
Loss of Asset
II
1
2
3
4
Severe injury, Damage
III
2
3
4
5
Minor Injury, Damage
IV
3
4
5
5
Minimal
Threat

Probability
A – Likely to occur immediately or within a short period of time.
B – Probably will occur in time.
C – May occur in time.
D – Unlikely to occur.

Severity
I – May cause death, loss of facility/asset, mission failure.
II – May cause severe injury, illness, property damage, severe mission degradation.
III – May cause minor injury, illness, property damage, minor mission degradation.
IV – Minimal threat, no mission degradation.

Risks assessed as a 1 or 2 are considered severe hazards that require immediate remedial actions.

Using the TEAM approach found in personal risk management checklists, options to control risk would include Transferring, Eliminating, Accepting or Mitigating the risk. We know now that something in the system needed to change, so we’ll look at the things that we can control. We can’t change the weather, but we can wait until it improves. We also predict the weather, so why not predict the risk factors as part of a predictive risk assessment? We could cancel the flight, or can we call in a properly rested reserve crew? We can’t add more concrete to the runway but we can brief the threats of a short, contaminated runway and strive to fly the textbook approach. And upon arrival at the airport we can get the most current runway condition and braking action information available from which landing or diversion decisions can be made. Some of these controls seem reasonable when we consider the performance of a crew that is not fatigued, so we need to remember to weight our decisions with the knowledge with consideration to the anticipated performance level of our crew.

Unfortunately on this April morning, the mitigations chosen were not sufficient and the challenges associated with operating this flight were not met. As mentioned in the probable cause statement, the decision to land without valid runway condition information resulted in the aircraft departing the end of the runway at a low speed and striking a berm, which damaged the airplane. Now from the comfort of our living room couch and in the clear 20/20 vision of hindsight, many of us would say we would have conducted the flight differently or even cancelled the flight, but that is not the reality of the events from that evening or the reality in real world operations. This is a case of where reality and rationality were in conflict, production and protection were at odds. Can this be written off as just another case of “pilot error”, or were there regulatory, organizational and supervisory factors in play here as well?

“…is it more important to ask whether isolating human action as the cause of an accident is 1) sufficient, given that causation is often multifactorial, non-linear and complex (Woods and Cook, 1999); and 2) meaningful for the designer or organizational decision maker interested in ensuring safety for the future” (Holden 2009, 36).

This mishap may have been prevented had there been flight time and duty time regulations in place that reflected the 24 hour, jet age nature of our transportation system and knowledge of fatigued human performance. Had there been better use of proper radio terminology or a fully operational and staffed air traffic control tower at the airport that night, the Captain might still be employed at this airline. And had there been a longer runway with the recommended runway safety area free of obstacles, specifically designed for use by air carrier turbojet aircraft, money could have been saved on repairs to the jet and loss of utilization for revenue service. The challenge we face with safety management systems is to capture these lessons learned and implement system solutions at all levels that will prevent future mishaps of this nature.

Deliberate and time critical risk assessment is not just part of a reactive process, but has been incorporated by many individuals and organizations as part of a systematic, proactive approach to safely conduct operations. There are many examples of risk assessment matrices and personal minimum checklists available for use, these can be found on the web and are valuable resources that can be used in the early stages of risk management to prevent mishaps. This is a good time to remember that risk control recommendations should address short term, mid term and long term solutions, and investigators should do so without consideration to cost. That is not because cost is not an issue, but because the cost is an issue that should be decided at the appropriate level. Many times the recommendations offer long term cost savings benefits because the hidden costs of a mishap can be three to five times the visible costs of a mishap. There may be damage to the environment, loss of trust and subsequently revenue in a customer base, reduction in revenue from loss of assets, civil and criminal legal fees and awards, and potential fines from a regulator.

Investigative teams can be exercised by participating in the risk management process, there is no need to wait for loss of life or damage to property. Many organizations are creating hybrid programs that combine accident analysis, investigation and prevention teams, the FAA is one example and the Air Line Pilots Association is another. In some organizations pre-formed mishap investigation teams do not come together until after a mishap; in generative programs these teams will work together with flight operations personnel and use risk management principles to investigate hazards and make risk decisions to reduce risks throughout the system.

Risk Assessment Snapshots
Situational awareness is needed in order to properly assess risks. Posting a flyer that says “Maintain Situational Awareness” is a lower level risk control that has limited effect on reducing the level of risk in a system. One example is a poster that depicts numerous airports that have complex runway and taxiway layouts. Perhaps a better long term approach would be to design user-friendly airports and work collaboratively to eliminate the identified hazards at complex airports. The FAA Runway Safety Program has more information on this challenge at http://www.faa.gov/airports/runway_safety

Factoids
Automatic Dependent Surveillance – Broadcast (ADS-B) services that provide flight and weather information services to enhance pilot’s situational awareness over portions of the Gulf of Mexico were activated in December of 2009.

3. Make risk decisions
Case Study 3: On November 26, 2008, at approximately 1930 Coordinated Universal Time (UTC), a Boeing 777-200ER, registration N862DA, serial number 29734, operated by Delta Air Lines as Flight 18, experienced an uncommanded rollback of the right hand (number 2) Rolls Royce Trent 895 engine during cruise flight at FL390 (approximately 39,000 feet). The flight was a regularly scheduled flight from Pudong Airport, Shanghai, China to Atlanta-Hartsfield International Airport, Atlanta, Georgia. Initial data indicates that following the rollback, the crew descended to FL310 and executed applicable flight manual procedures. The engine recovered and responded normally thereafter. The flight continued to Atlanta where it landed without further incident. Flight data recorders and other applicable data and components were retrieved from the airplane for testing and evaluation (NTSB Identification DCA09IA014).
The National Transportation Safety Board has not yet determined the probable cause of this accident, which had disturbing similarities to the British Airways crash at Heathrow in January of 2008.
“The flight from Beijing to London (Heathrow) was uneventful and the operation of the engines was normal until the final approach. The aircraft was correctly configured for a landing on Runway 27L and both the autopilot and the autothrottle were engaged. The autothrottles commanded an increase in thrust from both engines and the engines initially responded. However, at a height of about 720 ft the thrust of the right engine reduced to approximately 1.03 EPR (Engine Pressure Ratio); some seven seconds later the thrust on the left engine reduced to approximately 1.02 EPR. The reduction in thrust on both engines was the result of less than commanded fuel flows and all engine parameters after the thrust reduction were consistent with this. Parameters recorded on the Quick Access Recorder (QAR), Flight Data Recorder (FDR) and Non‑Volatile Memory (NVM) from the Electronic Engine Controllers (EECs) indicate that the engine control system detected the reduced fuel flows and commanded the Fuel Metering Valves (FMVs) to open fully. The FMVs responded to this command and opened fully but with no appreciable change in the fuel flow to either engine.
The aircraft had previously operated a flight on 14 January 2008 from Heathrow to Shanghai, with the return flight arriving on 15 January 2008. The aircraft was on the ground at Heathrow for 20 hours before the departure to Beijing on the 16 January 2008. Prior to these flights G‑YMMM had been in maintenance for two days, during which the left engine EEC was replaced and left engine ground runs carried out” (AAIB Interim Report G-YMMM 2008).

There was opportunity for risk decisions at many levels and by many different system agents for these events. Flight crews made time critical decisions and responded admirably in both events to prevent loss of life and minimize damage to aircraft and property. The survival of all passengers and crew onboard G-YMMM is a testament to the effort put into crashworthiness design by the industry, and skill and experience of the crew. As a result of the Heathrow crash, interim procedures were updated and emphasized that dealt with management of cold fuel and potential lack of engine response. The crew on Delta Flight 18 executed the applicable interim cold fuel procedures and still encountered the rollback condition on one engine, and they then executed the appropriate engine response non-normal checklist, restoring fuel flow to the engine back to a normal condition. Analysis of the risks associated with flying at high altitudes and cold temperatures had been ongoing for years, and recent advances in aircraft fuel and propulsion systems are pushing the envelope, incrementally increasing exposure to the risks associated with this type of flying. This is not a new hazard, it has been around since high altitude flights by strategic bombers experienced problems with fuel filter screens being clogged by ice in the fuel system. The fuel system was redesigned to be more tolerant of the ice and variations in fuel chemistry were explored, but there are still hard limits that remain and cold fuel properties that must be considered when designing systems, planning routes and operating the aircraft. The 777 with Rolls Royce Trent 895 engines has operated on millions of flights with only a few recorded events of cold fuel induced rollbacks, but there was clearly a hazard that had potential catastrophic consequences. Provided this information, it was clearly time to make a decision on the level and types of controls that needed to be improved or created to reduce the risks associated with these hazards to the appropriate level. As an immediate short-term control, flight crews were educated on the recent events and cold fuel management procedures were republished. During inspections of fuel tanks after long flights it was discovered that ice could remain in the system for hours and would not be removed through normal fuel sumping procedures. Maintenance and ramp personnel adopted procedures to remove as much water form the system as possible through amended fuel sumping procedures and by transferring cold wing tank fuel to the warmer center tank, which helped speed the melting of ice once the airplane was at on the ground. The center tank always remains warmer because it is located next to the warm cabin, while the wing tank fuel is exposed for long periods to significantly colder skin temperatures. As part of the investigation, test rigs were built to closely examine the interface between cold fuel, water and the aircraft fuel system components. During the course of this investigation it was hypothesized that restrictions to fuel flow were occurring at the inlet to the fuel-oil heat exchanger (FOHE), but the conditions of temperature, fuel flow and “sticky fuel” that lead to excess amounts of ice forming and contaminating the system at predictable times remained elusive. Restriction of fuel at the FOHE is problematic, as this creates a situation where there is potential for a single point failure in the system, with no means of bypass or recovery. Initially one could argue that if one engine shut down, there was a redundant propulsion system to carry the load, but now there was the Heathrow crash to consider. The hypothesis of fuel flow restrictions at the FOHE could not be conclusively supported, but the testing also did not conclusively rule out problems in this area.

What the investigation had done was discover a hazard in the system that needed to be assessed, and from the assessment by the multinational investigative bodies a decision was made by the regulatory agencies to require a new design and replacement the FOHEs on all of the affected engines. The FAA issued an Airworthiness Directive and estimated the cost of compliance to be approximately 8 million dollars, but when one considers that against the potential loss of life and property, the amount is insignificant. The scope of the Airworthiness Directive is wide, as these airplanes operate globally on a 24 hour basis, but there were also many other risk decisions made as part of this process, both at the micro and the macro level. The flight crews were tasked with both reactive and proactive real-time decisions on the flight deck, and these risk decisions must be made with consideration of other situational factors present that are precursors to mishaps. Ultra Long Range flying involves fatigue, limited communications, exposure to inhospitable flight regimes over mountainous terrain, polar regions, ocean crossings, and periods of time where suitable divert airfields are 3 or more hours away. Proactive risk decisions must be made by investigative teams on what hazard information must immediately be made to organizations and regulatory agencies, and generative decisions on risk are essential to ensure not only maintenance of safety systems but also continuous improvement. Throughout this study we can also see the various levels of risk controls, changes made at the system level to design the hazard out, barriers and controls put into place to reduce the amount and characteristics of water present in the system, warnings to operators of the system and administrative procedures and training. These hazard controls target appropriate levels, cross boundaries, and ensure optimum risk decisions are made at all levels, not just within individual departments. The key to effective risk decisions is to seek broad participation form all stakeholders, there are many identified here, so that all creative solutions can be identified and considered by those who make final decisions. Information drives decision making and is best understood when placed in the context where individuals utilize the information and make decisions (Leng 2009). Armed with these various threads of information a multidimensional tapestry can be woven that encases analysis, assessment and decision making. Another key element is to make sure that risk decisions are made at the appropriate level. Problems can arise when risk decisions are made at low levels and workers assume inappropriate or excessive risks for an organization without consulting supervisors and top leaders. In the case of Delta 18, the decision to continue the flight to destination versus a diversion was made in concert with the flight crew, maintenance, dispatch and flight operations managers. All of the personnel involved in this type of decision have the authority to stop the operation, and in the case of aviation the Captain retains the ultimate decision authority for crew and passenger safety. Effective decision making incorporates sound judgment, a common strategy and common goals, based on reliable information. A systematic approach to risk management is part of this strategy, problems are identified and assessed, consequences considered and decisions made. Good decision making is promoted by team approaches, adequate time to make decisions, philosophy, policy, procedure, training and experience. Once a hazard is detected it is essential to assess its potential affects by considering its influence on all parts of the system’s software, hardware and liveware. These effects will vary depending on factors of time and environment, some we can control and some we cannot control, and it is essential to consider this confluence of situational factors in our risk decisions. Decision making can be hampered by lack of time, inaccurate or unreliable information, production pressures and lack of teamwork. Good decision making is an integral part of effective risk management, without this we are more subject to risks associated with hazards and the threat of poor system performance due to human error (Naval Aviation Schools Command 2008). Decision on controls is a necessary and important step in the process, and timely decisions lead to effective implementation plans, the next step to be considered in the risk management process.

Risk Decision Snapshots
Many Helicopter Emergency Medical Services and Offshore Gas Production operators utilize a risk assessment matrix as part of the mission planning process. This deliberative process helps ensure that multiple situational factors are considered in their dynamic context, and appropriate risk decisions can be made before the aircraft ever leave the ground.

Factoids
In January 2005, Era Helicopter adopted a Safety Management System (SMS), which focuses on safety training, evaluation, and communications (Era Helicopters 2010).

4. Implement decisions
Case Study 4: In the reporter’s own words and NASA ASRS format: “WE PUSHED BACK FROM THE GATE AND I INSTRUCTED THE FO TO START BOTH ENGS, ANTICIPATING A SHORT TAXI. WE PERFORMED THE AFTER START CHKLIST AND THE FO CALLED FOR TAXI. AS WE STARTED THE TAXI, I CALLED FOR THE TAXI CHKLIST, BUT IMMEDIATELY BECAME CONFUSED ABOUT THE RTE AND QUERIED THE FO TO HELP ME CLEAR UP THE DISCREPANCY. WE DISCUSSED THE RTE AND CONTINUED THE TAXI. WE WERE CLRED TO CROSS RWY 4, AND I ASKED THE FO TO SIT THE FLT ATTENDANTS. HE MADE THE APPROPRIATE PA. WE WERE CLRED FOR TKOF RWY 1, BUT THE FLT ATTENDANT CALL CHIME WASN'T WORKING. I HAD CALLED FOR THE BEFORE TKOF CHKLIST, BUT THIS WAS INTERRUPTED BY THE COMS GLITCH. AFTER AFFIRMING THE FLT ATTENDANTS READY, WE VERBALLY CONFIRMED BEFORE TKOF CHKLIST COMPLETE. ON TKOF, ROTATION AND LIFTOFF WERE SLUGGISH. AT 100-150 FT AS I CONTINUED TO ROTATE, WE GOT THE STICK SHAKER. THE FO NOTICED THE NO FLAP CONDITION AND PLACED THE FLAPS TO 5 DEGS. THE REST OF THE FLT WAS UNEVENTFUL. WE WROTE UP THE TKOF WARNING HORN BUT FOUND THE CIRCUIT BREAKER POPPED AT THE GATE. THE CAUSE OF THIS POTENTIALLY DANGEROUS SIT WAS A BREAKDOWN IN CHKLIST DISCIPLINE ATTRIBUTABLE TO COCKPIT DISTR. THE TAXI CHKLIST WAS INTERRUPTED BY MY TAXI RTE CONFUSION. THE BEFORE TKOF CHKLIST WAS INTERRUPTED BY A FLT ATTENDANT COM PROB. AND FOR SOME REASON, THE TKOF WARNING HORN CIRCUIT BREAKER POPPED, REMOVING THE LAST CHK ON THIS TYPE OF THING. BOTH OF US FEEL OURSELVES TO BE HIGHLY DILIGENT PROFESSIONALS. WE GOT OURSELVES IN A BOX BY ALLOWING OURSELVES TO BE DISTR FROM THE CHKLIST. FROM NOW ON, IF I AM INTERRUPTED WHILE PERFORMING A CHKLIST, I INTEND TO DO THE WHOLE THING OVER AGAIN” (NASA ASRS ACN 658970).

Hazard control is an all hands effort. In this case a flight crew discovered several hazards to both man and machine and reported them via the National Aeronautics and Aerospace Administration (NASA) Aviation Safety Reporting System (ASRS). They were not the only flight crew to discover the potential hazards associated with a flaps up take off, but they were fortunate to not suffer the fate of the crew and passengers of Spanair Flight 5022, which crashed on takeoff in Madrid, killing 154 people (Spanair 2010). Some hazards are readily identifiable and easy to correct and can be done so on the spot, others are more difficult to identify and may be more difficult to control and correct. This is why it is not only important to correct hazards as quickly as possible, but also important to report hazards to agencies that can ensure long-term and widespread controls are put in place throughout the system. There is nothing more distressing than to prevent a mishap in one organization and to subsequently suffer los of life and property in another organization from similar hazards. Implementing risk decisions is another step in the common strategy to manage risk and one that requires the understanding that is derived form robust risk assessment. The Spanair mishap was not simply a “pilot error” mishap where a crew forgot to set the flaps and slats for takeoff, but rather it was a compound systems failure where failures in multiple systems design, training and operational procedures came together in one horrible moment to create a disaster. How could so many parts of the system fail, and how do we defend against this severe hazard? We must have a robust process to identify the hazards, and key players in industry are taking an in depth look at crashes and events that have similar markers. Several major air lines noticed threshold exceedances in Flight Operations Quality Assurance (FOQA) data and received Aviation Safety Action Program (ASAP) reports that indicated that procedures and aircraft systems were not as robust as desired in preventing attempts at no flap takeoffs. These teams then went out and searched for additional data within the ASRS database and the Aviation Safety Information and Analysis System (ASIAS) and found a surprising number of cases, hundreds, in fact, of similar events under similar circumstances. And mind you, these are not the kinds of surprises that help flight operations managers sleep at night. It was time to start over and look at all the factors involved in these near mishaps and disasters, and identify cross cutting factors. What the team found was that there were numerous distractions and concurrent tasks to be managed in a very dynamic taxi phase, just as mentioned previously in the NASA report. Pilots have to interleave calls from maintenance, ATC, flight attendants, the company, other aircraft, with checklists, runway crossings, engine starts, a saturated cacophony of communications and tasks that humans are ill suited to manage. On paper, these communication, navigation and procedural tasks are neatly and rationally scripted, but then the reality of flight operations intrudes and disrupts that scripting. Operational research was conducted with the help of scientists from NASA’s Flight Cognition Laboratory and a down-top approach was taken to write a new script and coordinate task timing at the participating carriers. What evolved form the research was that concurrent task management in dynamic environments was a very difficult ideal to achieve and this was not a good environment to be conducting critical flight duties such as setting and checking flaps and slats for takeoff. A pattern of perturbations to the flight operations “ideal” appeared during the course of the research:
1. Interruptions and distractions;
2. Tasks that cannot be executed in their normal, practiced sequence;
3. Unanticipated new task demands arise, and:
4. Multiple tasks that must be performed correctly.
(Loukopoulos, DIsmukes and Barshi 2009,106).

Researchers presented these findings and a decision was made to reduce the number of tasks that were being managed during the critical taxi phase, when crews should be focused on external communications and navigation around the airport and to move as many checklist items and flight deck tasks out of this phase as possible. Items that had previously been checked with a “Taxi Checklist” were moved so that they were checked either before taxiing or during an operational pause before taking the runway for takeoff. It is a very simple strategy to reduce the workload on crews during a busy time, create a focused period where critical flight items were checked, and open a window where crews could focus on prevention of ground collisions and runway incursions, other severe hazards to aviation. There was resistance met along the way, with people concerned that the extra time would lead to longer taxi times and increased fuel burns. These concerns were soon forgotten as the FOQA data started coming in that showed that crews were no longer taxiing onto the runway or attempting to take off with the flaps and slats not properly set. Flight crews subsequently reported that they liked the improved procedures and reports of runway incursions also decreased.

Hazard identification should contribute to analysis and assessments that lead to control implementations that will reduce the likelihood of a mishap occurring or reduce the severity. These implementations were proven to be practical in the real world of dynamic flight operations. Implementation decisions should be based on quantitative and qualitative confidence of not only understanding the problem and the assessed risk, but also on confidence that the control will have the desired effect on the system at the appropriate time and place. In this case, taking 30 seconds to conduct a safety critical checklist before taxiing reduced the consequence of and exposure to risks associated with a no flap takeoff. This was a quality decision implemented and verified using quantitative data, one more step in a common strategy to manage risk.

Decision Implementation Snapshot
An aircraft battery charger was left attached and unattended on the hangar deck of the Vintage Flying Museum in Ft Worth, Texas. A volunteer noticed the charger and identified it as a hazard to the Museum Safety Council. Subject matter experts in aircraft maintenance, firefighting and safety program management on the volunteer staff were available and assessed the risks to personnel and property. While the frequency of reported battery charger fires was difficult to ascertain, they were known to have happened throughout industry. This, coupled with the potential catastrophic loss of irreplaceable vintage aircraft, led to the Council’s recommendation to discontinue battery charging in the hangar unless personnel were present to monitor the charger and maintain a fire watch. The recommendations were presented to the Museum’s Directors and approved, and a record of the risk management process will be kept as part of the Museum’s Safety Management System.

Factoids
The entire risk management process at the Vintage Flying Museum took approximately 10 minutes, cost nothing (we’re volunteers, remember) and helped ensure the preservation of one of the world’s oldest flying bombers, the B-17 “Chuckie.” (Vintage Flying Museum 2010).

5. Supervise and watch for changes
Case Study 5: On February 12, 2009, about 2217 eastern standard time (EST), a Colgan Air Inc., Bombardier Dash 8-Q400, N200WQ, d.b.a. Continental Connection flight 3407, crashed during an instrument approach to runway 23 at the Buffalo-Niagara International Airport (BUF), Buffalo, New York. The flight was a Code of Federal Regulations (CFR) Part 121 scheduled passenger flight from Liberty International Airport (EWR), Newark, New Jersey to BUF. The crash site was approximately 5 nautical miles northeast of the airport in Clarence Center, New York, and mostly confined to one residential house. The four flight crew and 45 passengers were fatally injured and the aircraft was destroyed by impact forces and post crash fire. There was one ground fatality. Night visual meteorological conditions prevailed at the time of the accident. (NTSB Identification DCA09MA027).

Operations must be supervised at all levels and not only should we look for changes in the system, we must look for areas where the system is not performing at the desired level, areas where change is not for the good.
Much has yet to be determined and even more yet to be written about this devastating mishap. Too often in discussion of SMS the focus is on the analytical and scientific aspects, but we should pause to reflect on the fact that when we are discussing systems what we really are talking about is the people and their livelihoods. The real world is made up of flight crews, passengers and citizens who view safety of the system as an imperative, not a dot on a scatter diagram. With this in mind we should always focus on improvement, not blame. Initial media reports were quick to condemn the pilots, in a misguided effort to make sense of the tragedy and restore emotional stability to a weakened psyche. Could it be as simple as pilot error? Those more familiar with the complex dynamics of high risk organizations and the limits of pilot expertise know that there is more to be learned form this mishap and much needed improvements of the system have yet to be fully realized. What we have learned from the Colgan mishap so far is that a functioning safety culture is important to an organization, as it forms a foundation for organizational learning and sharing of information. We have also learned that we do not yet understand the chronic and acute effects of fatigue or have suitable awareness tools developed to proactively manage schedules and prevent hazardous fatigue scenarios from developing. Systems must be designed to support humans in task accomplishment versus add to the challenges of instrument flying in adverse weather conditions, and the system failed at multiple stages during the course of this mishap. The “automation age” is maturing in some parts of the industry to where concern over maintenance of manual flying skills is increasing, and in the meantime an increasing reliance on automation has left all operators at risk of not fully understanding the various nuances of complex systems or insidious emergence of hazardous system states. These are high reliability organizations where performance must be at optimal levels, from man, machine and the system. The operators must be highly skilled and experienced in order to make intervention decisions at opportune moments, but this intervention strategy should not serve as a long term defense against organizational, regulatory and manufacturing deficiencies (Reason 1997). An integrated defense should be developed that is human-centered but at the same time capitalize on the strengths of automated systems. This is a realm where it would be wise to expand the team and employ the expertise of cognitive engineers. Cognitive engineers are trained in methods that can be brought to bear on solving complex human-centered problems, they understand the rules of scientific observation and most importantly know how to measure whether an “implemented solution was indeed a solution” (Cooke and Durso 2008, 5). This importance of this last step of monitoring the system and gathering data to assess outcomes cannot be understated. It is not easy to do, and good examples of complex problem solving in tightly coupled systems are hard to find.

Risk management strategies are closely tied to the safety assurance process and we must check on the system to identify residual hazards and initiate corrective actions to improve performance. This is part of the looping process where risk controls are monitored as part of system operations and information is fed back into improvements to future and existing systems design. During this process we might find ourselves asking if the process itself is working as desired, especially if we consistently identify the same cross cutting factors in multiple mishaps. During this monitoring step all agents in the system should be looking for concurrent task management and workload issues, situations requiring rapid response, plan continuation bias, equipment failures or design flaws, misleading or absent cues, inadequate knowledge or experience provided by training or guidance and hidden weaknesses in defenses against error. These factors were identified during a study of mishaps from the NTSB database and represent “patterns of interaction that appear repeatedly” and “underlie many of the errors identified by the NTSB” (Dismukes, Berman and Loukopoulos 2007, 296). Mishaps involving loss of control have been at the top of the list since the dawn of aviation, and while this may just be the nature of aviation, it is necessary to always seek technological, training and procedural improvements in this area. Supervisors monitoring this system will seek changes to design of training programs, flight time and duty time regulations and improvements to aircraft design as well as advancement of the foundational principles of safety management systems. We must debrief both the strengths and weakness, identify opportunities and threat, understand the nature of human error and take appropriate actions to ensure proper performance (Reason 1990).

Proper management of safety information is a supervisory function and each SMS shall include a safety promotion program. This program should ensure proper management of safety information within the organization and education on how to identify, report and correct hazards at all levels, and shall include the following:

1. Collection and dissemination of safety information - The collection function includes procedures to ensure proper receipt and care of safety message traffic and other safety correspondence, safety publications and safety films/materials. In this case investigators past, present and future will consider the standards of how information is handled in premier organizations and the challenges for those striving to reach higher levels of safety and performance. Supervisors are the messengers, and the message must be clear in word and deed that operational safety is the priority over all other organizational objectives.

2. Dissemination of information on all facets of safety education and training; procedures for distribution of safety message traffic and other safety correspondence/material; distribution of safety periodicals and publications; participation in safety conferences, symposia, committees and councils; liaison with subordinate, adjacent and senior commands to exchange safety information; attendance at meetings for safety briefings, lectures and viewing of related films; and training in safety related subjects.

3. Control of safety information-The proper control of certain safety information is critical to the success of any safety management system, and the proper distribution, handling, use, retention and release of such information as prescribed by laws and regulations is a requirement for regulators, organizations and individuals who operate within the system.

Supervisors should conduct periodic safety surveys to measure the organization’s safety posture. They may consist of in-house safety surveys in which organization personnel are used to conduct the survey. They may also consist of external services provided by a regulator, audit team or other subject matter experts. The recommended frequency is every two years, a valid survey functions as part of the supervisory and assurance process works in concert with other reporting systems. Supervision and monitoring of the risk management process completes the loop and encourages future participation in the system, especially when reporters are given feedback and see generative improvements to the system.

Supervision Snapshots:
Turbulence injuries to passengers and flight attendants create the highest number of reportable accidents to the NTSB every year. Supervisors in flight operations, meteorology, dispatch, and customer service monitor weather conditions and make both strategic and tactical adjustments to flight patterns to avoid areas of known severe turbulence. Flight crews also coordinate in flight activities to ensure passengers and crew are protected when transiting hazardous weather areas.

Factoids
98 percent of turbulence injuries are suffered by passengers who were not wearing seat belts and flight attendants who are moving around the cabin conducting flight duties. The other 2 percent of the injuries are primarily caused by unrestrained passengers striking those who are properly restrained.

Future Opportunities for Risk Management
“SMS for SMS” – How do we ensure that a SMS does what is supposed to do? The NTSB is beginning to look at the components of SMS during the course of mishap investigations, and increasingly more focus will be placed on the process itself as well as failures of various components. This is not really new, but we need to be sure that the process supports the intent of our evolving safety philosophy. We must scan the environment, develop standards and share recommended best practices to optimize continuous improvement of SMS. In order to do this we must increase sharing and management of privileged safety information.

Dynamic Risk Management – Organizations are getting better at managing risks in static environments, but there are still plenty of opportunities to proactively and predicatively manage dynamic risk that emerges a s it emerges from organic changes to systems. Risk is like energy, it does not disappear but rather it changes form and moves from one business area. Financial risk can easily be translated into risk of physical harm and loss of property, and the factors that exist to help define these interactions already exist, they just need to be captured, stored as information and managed at the appropriate time. Management of corporate knowledge needs to incorporate the common strategies of risk management, with initial emphasis on system design and analysis. If we can’t define the system or the process, we have very little chance of controlling it.

If you can’t describe what you’re doing as a process you don’t know what you’re doing.” -W. Edwards Deming

This also leaves us exposed to the hazard of wasting valuable time, money and brainpower on reinventing the wheel or recreating a mishap. Think about instances in your organization where you have seen this happen and work towards developing information systems that will help in this endeavor.

Conclusion
The most effective safety enhancements have historically come from the investigative process and lessons learned.

“Those who cannot remember the past are condemned to repeat it.”
-Philosopher George Santayana

Our chosen field is inherently a collaborative one. We succeed by sharing, whether it is online cataloging, database development, reference networking, or the development of standards and recommended practices. Your colleagues are your best resource, both now and in years to come. These partnerships may be developed with the safety manager at a local helicopter manufacturer, volunteer air safety advocates, fellow employees, the Director of Safety at a large helicopter company, airport personnel and government regulators. Remember that risk management as well as SMS is scalable and should be adapted as necessary to address the nature of each unique problem. Whether the discussion revolves around a malfunctioning aircraft towbar, a tower cab display design, a radio, an airplane, a procedural manual, an air traffic control system, governmental regulation or international law, working together with other agents in the system is the key to effective management of risks.


















TERMS AND DEFINITIONS
Control. A mechanism that manages a risk. Risk control options for each identified hazard generally fall into the following categories: engineering (e.g., design, tactics, weapons, personnel/material selection, etc.), administrative (e.g., instructions, SOPs, LOIs, ROE, SPINS, etc.), or personal protective equipment
(PPE) (e.g., eye protection, ear protection, body armor, etc.). Some
administrative control option methods are accept (i.e., accept the hazard's risk),
reject (i.e., do not accept the hazard's risk), avoid (i.e., minimize exposure/effects
by different pathway), delay (i.e., postpone until another time where risk is less),
benchmark (i.e., utilize a control from another entity... reinventing the wheel
is not necessary), transfer (i.e., move hazard to another participant/asset), spread (i.e., diminish the hazard's risk by distributing it among multiple participants/assets), compensate (i.e., counterbalance the hazard with something
that negates its effect), and/or reduce (i.e., limit the exposure to a particular
hazard).

Cumulative probability. Summation of probabilities of all causation factors and their impact on participants (e.g., the more participants exposed to a hazard, the greater the cumulative probability of that hazard leading to a consequential error).

Deliberate risk assessment/ORM. An application of all five steps of the ORM process during planning where hazards are identified and assessed for risks, risk control decisions are made, residual risks are determined, and resources are prioritized based on residual risk. Usually the risk determination process involves the use of a risk assessment matrix or other means to assign a relative expression of risk.

Hazard. A condition with the potential to cause personal injury or death, property
damage, or mission degradation. Also known as a “threat”.

In-Depth risk assessment/ORM. An application of all five steps of the ORM process during planning where time is not generally a factor and an in-depth analysis of the evolution, its hazards and control options is possible. As in the Deliberate ORM process, hazards are identified and assessed for risks, risk control decisions are made, residual risks are determined, and resources are prioritized based on residual risk. Usually the risk determination process involves the use of a risk assessment matrix or other means to assign a relative expression of risk.

Operational analysis. A process to determine the specific and implied tasks of an evolution as well as the specific actions needed to complete the evolution. Ideally, the evolution should be broken down into distinct manageable segments based on either their time sequence (i.e., discrete steps from beginning to end) or functional area (e.g., ASW, ASUW, AAW).

Operator. An individual who has the operational experience, technical expertise, and/or capability to accomplish one or more of the specific or implied tasks of an
evolution.

ORM Operational risk management.

Residual risk. An expression of loss in terms of probability and severity after control measures are applied. Simply put, this is the hazard's post-control expression of risk (i.e., RAC or other expression of risk).

Risk assessment. A process to determine risk for a hazard based on its possible loss in terms of probability and severity. A hazard's severity should be determined from its impact on mission, people, and things (i.e., material, facilities, and environment). A hazard's probability should be determined from the cumulative probability of all causation factors (e.g., more assets involved may increase overall exposure to a particular hazard). Ideally, experiential data (i.e.,
hazard/mishap statistics) should be utilized during the hazard assessment process to assist in determining hazard probability.


Risk Assessment Code (RAC). A numerical expression of relative risk (e.g., RAC 1 = critical risk/threat, RAC 2 = serious risk/threat, RAC 3 = moderate risk/threat, RAC 4 = minor risk/threat, and RAC 5 = negligible risk/threat). See risk assessment matrix.




















Risk assessment matrix. A tool used to determine a relative expression of risk for a hazard by means of a matrix based on its severity and probability. Typically, a numerical risk assessment code (RAC) is assigned to each hazard to represent its relative risk (i.e., 1 = critical, 2 = serious, 3 = moderate, 4 = minor, and 5 = negligible). For example:


RISK ASSESSMENT MATRIX
PROBABILITY
A
B
C
D
Likely
Probable
May
Unlikely
S
E
V
E
R
I
T
Y
I
1
1
2
3
Death,
Loss of Asset
II
1
2
3
4
Severe injury, Damage
III
2
3
4
5
Minor Injury, Damage
IV
3
4
5
5
Minimal
Threat

Probability
A – Likely to occur immediately or within a short period of time.
B – Probably will occur in time.
C – May occur in time.
D – Unlikely to occur.
Severity
I – May cause death, loss of facility/asset, mission failure.
II – May cause severe injury, illness, property damage, severe mission degradation.
III – May cause minor injury, illness, property damage, minor mission degradation.
IV – Minimal threat, no mission degradation.





Risk decision. A determination of which risk controls to implement to mitigate or
manage a particular risk. During the risk decision process, a risk control options'
effects should be considered before passing recommendations to the appropriate
level for making risk decisions. Risk control options' effects should be determined from their impact on probability of the hazard, impact on severity of the hazard, impact of the risk control cost (i.e., what's being sacrificed), and impact of them working with other controls (i.e., impedance vs. reinforcement). When cost effective, multiple risk control options (i.e., layered or overlapping controls) should be considered. Risk control options should be chosen to enhance their impact on either probability and/or severity (e.g., goggle-use impacts on both probability and severity of eyes being injured) and chosen in the most mission supportive combination (i.e., when one set of controls is more supportive of the mission than another set with the same effect, choose the controls that support the mission).

Risk. An expression of possible loss in terms of severity and probability.

Time Critical ORM. A risk management process that is limited by time constraints,
which precludes using a deliberate or in-depth approach. One exemplar time critical ORM process consists of four steps:
Assess (your situation, your potential for error),
Balance Resources (to prevent and trap errors),
Communicate (risks and intentions), and
Do & Debrief (take action and monitor for change).

Threat. See hazard.

















REFERENCES

Army Safety Center 2009. Crew Resource Management. https://safety.army.mil/crm

Budd, John M. 2005. The Changing Academic Library: Operations, Culture and Environment. American Library Association. Chicago

Dekker, Sidney 2008. Just culture: Balancing safety and accountability. Burlington, VT. Ashgate Publishing Company.

Dismukes, R. Key, Benjamin A. Berman and Loukia D. Loukopoulos 2007. The limits of expertise. Rethinking pilot error and the causes of airline accidents. Burlington, VT. Ashgate Publishing Company.

Era Helicopters 2010. Safety. Retrieved on December 20, 2009 from http://www.erahelicopters.com/content/e3/index_eng.html

Evans, G. Edward and Patricia L. Ward 2004. Management basics for information professionals. New York. Neal-Schuman Publishers, Inc.

Federal Aviation Administration (FAA) 2007. Flight Standards SMS Standardization Manual.

Holden, Richard J. 2009. People or Systems. Professional Safety. December 2009: 34-41.

Human Factors in Accident Investigation Manual (HFIAI) 2005. Transportation Safety Institute. Oklahoma City, OK.

International Civil Aviation Organization (ICAO) 2010. Annex 13: Aircraft Accident and Incident Investigation.

International Civil Aviation Organization (ICAO) 2010. Safety Management Manual. Retrieved on January 17, 2010 at http://www.icao.int/anb/SafetyManagement/Documents.html

Jeng , Ling Hwey 2009. Texas Woman’s University. School of Library and Information Studies: Welcome form the Director. Retrieved December 12, 2009 from https://www.twu.edu/library-studies/welcome.asp

Loukoplous, Loukia D., R Key Dismukes, and Immanuel Barshi and 2009. The multitasking myth. Handling complexity in real-world operations. Burlington, VT. Ashgate Publishing Company.

National Aeronautics and Space Administration 2010. Aviation Safety Reporting System Database Online. Retrieved January 16, 2010 at http://akama.arc.nasa.gov/ASRSDBOnline/QueryWizard_Filter.aspx

National Transportation Safety Board (NTSB) 2008. Aircraft Accident Report NTSB/AAR-08-02. NTSB Identification: DCA07FA037 Thursday, April 12, 2207 in Traverse City, MI, Bombardier CL-600-2B19 Registration: N8905F. Retrieved December 15, 2009 from http://www.ntsb.gov/ntsb/brief.asp?ev_id=20070501X00494&key=1

National Transportation Safety Board (NTSB) 2008. Uncommanded rollback of the right hand (number 2) Rolls Royce Trent 895 engine during cruise flight. NTSB Identification: DAC09IA014. Wednesday, November 26, 2008 in Atlanta, GA. Boeing 777, registration N862DA. Retrieved January 16, 2009 from http://www.ntsb.gov/ntsb/brief.asp?ev_id=20081201X44308&key=1

National Transportation Safety Board (NTSB) 2009. Crash of Colgan Air Bombardier Dash 8-Q400 during instrument approach to Buffalo-Niagra International Airport. NTSB Identification: DCA09MA027. Thursday, February 12, 2009 in Clarence Center, NY. Bombardier INC DHC-8-402, registration N200WQ. Retrieved December 17, 2009 from http://www.ntsb.gov/ntsb/brief.asp?ev_id=20090213X13613&key=1

Naval Aviation Schools Command 2009. Decision Making. Retrieved December 8, 2008 from https://www.netc.navy.mil/nascweb/crm/standmat/seven_skills/DM.htm

Naval Postgraduate School 1995. Aviation safety program: Command policy and reporting. Monterey, CA.

Naval Safety Center 2010. DOD Standard Practice for System Safety. Retrieved on January 17, 2010 from http: safetycenter.navy.mil/instructions/osh/milstd882d.pdf

Naval Safety Center 2009. Op NavInst 3750.6. Naval Aviation Safety Program. Retrieved on December 3, 2009 from http://www.safetycenter.navy.mil/instructions/aviation/opnav3750/index.asp

Reason, James 1990. Human error. New York: Cambridge University Press.

Reason, James 1997. Managing the risks of organizational accidents. Burlington, VT. Ashgate Publishing, Ltd.

Rosekind, M.R. 2005. Managing work schedules: An alertness and safety perspective. In M.H. Kryger, T. Roth, W.C. Dement, editors, Principles and Practice of Sleep Medicine: 682.

Schechtman, Gregory M. 1996. Manipulating the OODA loop: The overlooked role of information resource management in information warfare. Retrieved on January 9, 2010 from www.au.af.mil/au/awc/awcgate/afit/schec_gm.pdf

Shappell, Scott A. and Douglas A. Wiegmann 2003. A Human Error Approach to Aviation Accident Analysis. Burlington, VT. Ashgate.

Simon, Herbert A. 1997. Administrative behavior. New York, The Free Press.

Spanair Flight 5022 2010. http://en.wikipedia.org/wiki/Spanair_Flight_5022

Transportation Safety Institute. http://www.tsi.dot.gov

Vintage Flying Museum 2010. About Vintage Flying Museum. Retrieved on January 16, 2010 from http://www.vintageflyingmuseum.org/

Vintage Flying Museum Museum 2010. Safety Management System. Retrieved on January 7, 2010 at http://groups.google.com/group/ftwasp/web/vintage-flying-museum-safety-management-system


Wallers, James M and Robert L. Sumwalt 2000. Aircraft accident analysis: final reports. McGraw-Hill. New York








About the author:




Kent Lewis is an Air Transport Pilot currently flying with Delta Air Lines in Atlanta, Georgia. He has international flying experience on the Boeing 777 and currently flies the MD-88. Kent graduated with Honors from the U.S. Navy Postgraduate School’s Aviation Safety Officer course, and was the Director of Safety and Standardization at MCAS Yuma, the largest aviation training facility in the United States and was previously the Safety Department Head at VT-27, the Navy’s busiest aviation training squadron. His safety programs have been recognized as the best in the Navy/Marine Corps and have a zero mishap rate. He currently volunteers as the Director of Safety at the Vintage Flying Museum in Ft Worth, TX and owns the safety website www.signalcharlie.net. Kent also volunteers as a FAA Safety Team Lead Representative for the Ft Worth FSDO, and he was selected as the 2009 National FAASTeam Representative of the Year. He has attended the Air Line Pilots Association Basic Safety School, Safety Two School, Accident Investigation 2 (AI2), Advanced Accident Investigation course (AI3), FAA SMS Standardization course and FAA Root Cause Analysis Training. Kent's focus is Human Factors and System Safety. Current committee work includes ALPA SMS and Human Factors, and FAA Runway Safety Root Cause Analysis Team, as well as teaching the Human Factors module in ALPA's AI2 course. As the ALPA Atlanta Council 44 Safety Chairman, he represents over 3500 of his fellow pilots on aviation safety matters. Education Kent is a graduate of the University of Texas at Austin, with a Bachelor of Arts degree in History. He is currently pursuing a Masters in Library Science at Texas Woman's University.



Flight Instruction Kent was a flight instructor while in the U.S. Marine Corps, teaching all phases of the fight curriculum to Navy and Marine primary and intermediate students in the T-34C, a fixed wing turboprop trainer. He also flew helicopters and was a Terrain Flight, Search And Rescue and Night Vision Goggle Instructor. In 1998 he was qualified as a Aircrew Coordination Training Facilitator. He gained certification with the FAA to be a Certificated Flight Instructor, Instrument Instructor and Multi-engine Instructor in airplanes. Associations// Kent is a member of the International Society of Air Safety Investigators, Air Line Pilots Association, Aircraft Owners and Pilots Association, Society of Aviation and Flight Educators, Ancient Order of Shellbacks, and the National Association of Flight Instructors.

Back to Special Emphasis Items
Back to Safety Management Systems

References:

Pilot Handbook of Aeronautical Knowledge
FAR/AIM
Aviation Instructors Handbook
Instrument Flying Handbook




Contact info: lewis.kent@gmail.com