Office of Operations
21st Century Operations Using 21st Century Technologies

9. Other Testing Considerations

9.1 Overview

This chapter provides some other helpful and practical testing considerations that did not fit elsewhere into the basic structure of this document. They have been include here because they are important and provide specific guidance and recommendations that should be useful in your test program.

9.2 Meaning of Shall, Will, May, Should and Must in Requirements Statements

The language, syntax, and structure of you requirements statements are extremely important as they directly affect the quality and thoroughness of the test program that is based on them. Certain terms used in requirements statements have specific contractual implications.37

"Shall" is used to confer a requirement on the provider of the product or service and is typically understood to mean at the time of delivery.

"Will" is used to confer a requirement on the receiver (the accepting agency) of the product or service when that product or service is delivered. "Will" is also used to imply a future requirement on the provider that should be clear from the context of the statement.

"May" is a conditional term and implies that either the provider or the receiver has the option to meet the stated requirement. "May" statements are generally not testable unless additional conditions are included to indicate what is expected if the provider or receiver elects that option.

"Should" falls into same category as "may" and is considered an optional requirement that may or may not be included in the system depending on the provider's perspective.

"Must" is used to add additional emphasis to the requirement statement that can be directed at either the provider or receiver, but more typically the provider. "Must" is typically used in a requirement that has specific legal or contractual ramifications such as may be invoked by requiring a particular State statute or governing regulation be strictly adhered to in the performance of the work. In the contract specifications, it has the same basic meaning as "shall."

From a contracting perspective, only requirements with MUST and SHALL statements are likely to be provided by the contractor. All other requirements should be considered part of a "wish list" and will not be part of the testing program.

9.3 How To Write Testable Requirements - Do's and Don'ts

The requirements statements contained within the procurement specifications are the basis for test and acceptance. Poor or missing requirements statements result in requirements that cannot be verified and products or services that don't meet expectations. Requirements statements should be written as clear, unambiguous, declarative sentences. Proper grammatical sentence structure is as important as is use of "shall" and "must," particularly in defining who is the responsible party for providing the product or service and who will be accepting delivery of that product or service. The following are some do's and don'ts for writing and reviewing testable requirements.

Do's

Write the requirement in simple, understandable, concise terms; be short and to the point. If complex technical terminology is necessary, make sure those terms are defined or well understood by the provider as well as the receiver.

For each [individual] requirement there should be one shall or must statement. If the requirements are complex, then they should be subdivided into a string of individual requirements to the greatest extent possible. A test case will be generated to verify each "shall."

Write the requirement as a positive statement. If something is not desired, try to phrase the requirement to state what is desired. However, this is not an absolute; if the system is not supposed to allow expired passwords to be used, then an explicit "shall" statement with that requirement should be included. For example, "The system shall reject and log each user logon attempt that uses an expired password."

Have a test method, such as inspection, certificate of compliance, analysis, demonstration or test, and pass/fail criteria in mind when writing or reviewing the requirement. If you can't figure out how to verify the requirement or what criteria constitutes acceptance, you can't expect the provider to demonstrate compliance with the requirement. This approach may cause the writer to re-write the requirements with testing in mind.

Consider the complexity, technical expertise, and expense of the testing that may be necessary to verify the requirement -simplifying the requirement may result in the same end product or service, but at a reduced test expense.

When preparing the requirements statement, be careful what frame of reference is used for the reader. As noted earlier, software developers and traffic engineers have entirely different frames of reference. What may seem clear to the traffic engineer may become mangled when interpreted by a software developer! As the requirements are prepared, make sure that the requirements will have the same interpretation regardless of the background and experience of the reader. Add clarifying information when necessary to ensure a common understanding by readers with radically different backgrounds.

Don'ts

Avoid the use of "may" and "should" in the requirement statement unless you specifically want to give the provider an option in how that requirement can be met, or give the receiver an acceptance option or an "out."

Avoid negative requirements. For example the statement "tightening torque shall not exceed forty foot-pounds" implies anything less than forty foot-pounds would be acceptable, but if the requirement applies to the torque applied to tightening a fastener, a positive statement such as "shall be tightened to 35 foot-pounds +/- 4 foot-pounds" is much better, because it defines the minimum as well as maximum torque to be applied and can be definitively measured for acceptance testing.

Don't mix dissimilar or unrelated requirements in the same statement. This practice complicates requirements traceability and verification testing. Unrelated requirements will usually be verified at different times, under different test conditions, and using different test procedures or methods.

9.4 Test Pass/Fail Criteria

Test procedures should detail each test step. They must indicate what requirement (or partial requirement) is being verified; what action, event, or condition must occur to execute or complete the test step; and what is the expected outcome or response to that action. The expected result (outcome or response) is the pass/fail criteria for that step. If the test step was executed and the expected result did occur and was either witnessed or recorded and can be confirmed, then that test step can be considered to have passed, and the test step's requirement is verified. If the test step could not be executed or was executed and the expected result did not occur, was not witnessed, or cannot be confirmed from the record, then that test step must be considered to have failed and that requirement not verified. Any outcome other than the expected one should be considered an anomaly or error (i.e., failed).

Beware of transient conditions. Testing is an important aspect of system acceptance and everything that happens during the test counts. Hence, all test participants must be focused on the testing operation. With today's complex systems, it is not unusual for "strange" things to happen that are not repeatable. For example, as the operator is performing a test, a specific screen appears to show a non-existent error which does not re-appear when the screen is refreshed. Was this an error or anomaly? At this point, the system might be considered suspect and the tester may want to repeat the step (to confirm the final condition). Be sure to log this type of event and file a report. Although it may not re-appear during the testing, it may provide a clue to some other unrelated problem experienced later. At the very least, the vendor should be required to explain what and how it could have occurred.

9.5 Test Reporting

Test reporting requires an accurate log of the test configuration, test conditions, the requirements that were verified, specification of the pass/fail criteria, and identification the completed test steps. Good test procedures establish what was intended and provide a checklist for tracking the progress of the testing. The test report should summarize the test activities, including test date, time, and location, test witnesses and observers present, exceptions or anomalies noted, and SPCRs written. The test report should include the original copy of the procedure checklist, with test witness-initialed steps, data colleted, supporting analyses, and a test completion status and/or re-test recommendation. The test report is usually prepared by the test conductor and submitted to the test director. The accepting agency determines final test completion status from review of the test report.

One approach that may be useful is to construct a large 3-ring binder with the complete test procedure. Then, as each test step is taken that requires inspection, calibration certificates, print-outs, pictures, etc., this data can be added to the book and provide a complete record of what was done and by whom. If one is performing hardware testing, it is advisable to take pictures of the test configuration, test actions, scope traces, and the environment. Such additional information can be invaluable when preparing the final test report and provides further proof of the activities and actions. There are techniques such as using "alt-PrtScn" and "ctrl-PrtScn" to capture screen shots (the active window or the whole screen) that can be used to provide snapshots of the user interaction with the system.

It is important that the agency maintain control of the test data collection document or test "book." The agency must be diligent in recording the results. The perspective, while unpleasant to consider, must be to keep records that the agency could use in a court of law to prove or show contractor non-compliance – i.e., test failure. Under worst case scenarios, the agency may be called on to show cause as to why and how the test results show that the contractor did not complete the work as contracted. These test records may be the only record of what happened since both the agency and contractor personnel witnessed the tests and initialed the logs.

9.6 Test Failures and Re-Testing

Tests fail for a variety of reasons, many of which have nothing to do with whether or not the requirements being verified by the test have been met. Examples of test problems that may be encountered resulting in a test failure include:

  1. Poor or inadequate test procedures.
  2. Incorrect test procedure execution (e.g., skipping a test step, executing test steps out of sequence, failure to check off completed test steps, and failure to record test data).
  3. Inappropriate test conditions (e.g., lack of test support personnel, insufficient traffic volume to trigger congestion detection algorithm response, poor visibility due to deteriorating weather conditions, etc.).
  4. Device failures, including the communications infrastructure.
  5. Failure of test equipment.

Many of these situations can be avoided by thoroughly reviewing all test procedures, executing a dry run of the test before the formal test in front of the customer (agency), providing additional on-call test and maintenance support personnel, checking expected test and weather conditions before starting the test, and ensuring the availability of backup test equipment.

Even with these pre-test precautions, however, things happen and tests fail. The procurement specification should allow for re-testing both for cause (i.e., the test failed to verify the requirement due to product design, implementation, or test inadequacy) and for reasons such as those listed above. Where possible and at the direction of the test conductor, the procurement specification should allow the test to be re-started. Examples include restarting from a point before the skipped step, or steps executed out of sequence, before the failure of a supporting device - not under test, or failure of the test equipment, etc. Alternatively, the test may be repeated from the start (either immediately or within an hour or so), provided the problem is recognized before the test is terminated by the test conductor and if test conditions can be re-set (i.e., error conditions cleared, processes or devices re-initialized, equipment replaced or repaired within a reasonable period of time) and the other necessary test conditions can still be met. Testing is expensive, resources and schedules are finite, and thus it is to everyone's advantage to complete a test that would otherwise result in a failure if a minor procedural error or test condition can be resolved quickly allowing the testing to proceed to a successful conclusion. The procurement specification should also allow minor test irregularities to be waived or partial test to be executed to "clean up" an irregularity. The procurement specification must also state very clearly the agency's process for resolving test conflicts or disputed results.

There may be conflicting interests once a test step has clearly failed due to equipment or software malfunction. If such a failure is discovered during day 2 of a planned 5-day test, does one press on and complete the entire test (where possible) to see if there are other problems? Or does one halt the test until the repair/correction can be made and then restart the testing? If the repair requires a week to complete, what is the proper course of action? The vendor might argue that to continue simply expends time and resources better spent on correcting the problem and preparing for a re-test, while the agency may want to press on to see if there are other issues or problems with the system. The specifications should place this decision with the agency and require that once the test has started it is the judgment of the agency as to whether the test continues after such a failure or is terminated and re-scheduled. Note that in some instances, continued testing may be impossible due to such occurrences as a corrupted database, failure of a server, or failure of some other mission-critical device.

Another issue that must be considered is "how many times" the vendor is allowed to fail the testing before terminating the contract or forcing some other drastic action. While such conditions are not anticipated or expected to occur, project specifications need to address possible scenarios and place limits on the number of re-tests allowed and the timing of such re-testing. Then are also issues of who bears the cost of such re-testing as well as how many re-tests are allowed. While it should be clear from the contract that the contractor is responsible for the test environment and all costs associated with performing the tests (including laboratory personnel, test equipment, consumables, utilities, space, etc.), there may be a cost to the agency for consultant services to observe and monitor the testing as well as possible travel costs if the testing is performed at another facility some distance from the agency's offices. Such issues need to be addressed in the contract. Examples include a requirement that the contractor prepay all travel expenses (using government per diem allowances) for all testing (which tends to place distant vendors at a financial disadvantage), or a limit of 3 "free" testing attempts. In some instances, the contractor may be required to reimburse the agency for the expense of its consultants for retesting after a limited number of "free" attempts. How this is dealt with will depend on the number of units involved, the contract provisions, and the agency's procurement policies and procedures.

9.7 Testing Timeframes

Defining the timeframes for testing is a critical function of the procurement specification. Be prepared to address test interruptions by planning for them early and defining their handling.

A test's timeframe should be defined in the procurement specification. Usually this timeframe is set in terms of contiguous calendar days. This terminology is important to set a maximum time for conducting the tests and to avoid terminology that allows contractors to interrupt or suspend tests in order to make corrections and then resume the tests.

When defining operational tests of longer durations (30-90 days), the procurement specification must be realistic about the probability that external forces will impact system operations. For example, outages with leased communications facilities will impact overall system operation but are beyond the control of the contractor. Also, field facilities may be damaged by vehicles causing knock-downs of equipment. It is not realistic to hold a contractor liable for events beyond their control. There will be small maintenance issues that occur, but these need to be put into perspective and dealt with without impacting the operational test program. For example the operational test should not be failed due to a report not working because the printer is out of paper, but the test should be concerned about a loss of communications due to the failure of a power supply in the custom central communications equipment.

One also needs to be realistic in understanding that equipment does fail and that during a 30 or 60-day "observation" period, it is possible that of 250 field devices, one or more may experience a failure or anomaly. Restarting such an observation period at day one for each such failure will almost guarantee that the test will never be completed. While agencies may see this as an extension of their warranty, such expectations are unrealistic. Instead, the agency should establish a failure management and response approach that recognizes this possibility and establishes criteria for determining that the test has failed and must be re-started vs. continued without suspension, or suspended and continued once the failure has been corrected. Factors such as the severity of the failure and the time to repair should be incorporated into the decision. For example, if during the period of observation, a DMS panel experiences a failure, the contractor might be allowed 48 hours to correct the problem without affecting the test; however, if more than 3 signs (of 25) experience such a failure within the 60 day test, the test may be considered to have failed and must be restarted. The decision should be based on whether there appears to be a symptomatic problem or a random failure of the device. For system software this may become more problematic. For example, if there is a "memory leak" that seems to be causing the system to crash and need to be re-booted about once per week, does one continue to move on, suspend, or terminate? If the problem can be quickly diagnosed and repaired, a restart is probably in order, but if the problem appears half way into the test, what is the best approach? Should this be noted and corrected under the system warranty? Or, should the test be halted, the problem corrected, and the test restarted?

There are no easy answers to these issues; the agency needs to ensure that their system and devices are reliable, while the project needs to move on to final acceptance so that the contract can be closed out. Be prepared to deal with these issues and be sure that they are addressed in the project specifications.

9.8 Testing Organization Independence from Developers

Testing should be developed and conducted by an organization that is independent of the product development organization. For test procedures, this helps to ensure that the test procedures verify the requirements as stated, not what was assumed to be wanted or needed by the developer, or, for that matter, intended by the acquiring agency. If the requirement is unclear, vague, or ambiguous, the test organization will not be able to develop a test procedure to verify it and will ask that the requirement be revised or rewritten such that it can be verified. This needs to happen early during product design and development, not after the product has been delivered for acceptance testing. Therefore, the test organization needs to start test planning and test procedure development in the requirements definition and analysis phase of the project.

For test execution, test personnel from an independent organization will not overlook or ignore unexpected test results that a developer might. Often the developer will overlook anomalies because he can explain them or knows that those unexpected results are not related to what is being tested (i.e., they could be caused by an unforeseen interaction with another process or device not directly related to the test). If there are problems or unexpected results that occur during the test, they need to be recorded and reported so they can be corrected or resolved before accepting a potentially flawed product.

While this section recommends an "independent" test organization, it is likely that the contractor will handle the testing from test plan generation to test execution as well. Within most organizations, an independent test group will take on this responsibility and this should be permissible as long as the personnel are independent from the developers and designers. Review the contractor's organization chart and determine the degree of independence of the testing group.

9.9 Testing Relevancy and Challenges

Some challenges that must be met in a testing program relate to the relevancy of the test with respect to the actual test conditions and test limitations and constraints. For example, if the stated test condition requires a period of high traffic volume, testing at night or during an off-peak period will probably not achieve the desired test result, compromising the ability to verify the requirement which depended on the existence of that condition for demonstration. Make sure the expected test conditions are relevant for the requirements being verified. For the example cited, one may need to develop calibrated simulators that are installed in the remote cabinets to actually demonstrate that the specific requirements have been met.

Test limitations and constraints must also be considered to ensure that the test is relevant and the test results will demonstrate compliance to the requirements being tested. For example, if the test is limited to the CCTV camera subsystem, it should not have any test steps that verify requirements for the DMS subsystem. However if camera selection for control is accomplished by clicking a mouse pointer on the camera's icon on the GIS map display, requirements for that control action and related GIS display are relevant and should be also verified in the CCTV camera subsystem test. Further, where the GIS display is active, it may be prudent to ensure that all map "layers," which would include the DMS, be shown.

A typical test constraint might limit DMS test messages to a pre-defined fixed set, even though a much larger set of both pre-defined and user-generated messages will ultimately be required and verified in a different test. In this example, the test is limited to a pre-defined set, so the DMS software needed to support the test does not have to be the final version. More precisely, where the final version would typically allow the user to both edit existing messages and create new ones, the test software would only allow the selection of pre-coded messages. Here, the test relevancy has been purposely limited to verifying the ability of the DMS subsystem to access stored messages and display them. This test limitation allows early verification of a critical portion of the DMS requirements while design and development of software to satisfy the complete set of requirements continues. Such a situation might be useful for conducting a 30 or 60 day test message burn-in where later tests will fully verify the central system capabilities.

9.10 Issues Affecting System Reliability

When establishing a burn-in or extended system test, it is important to consider what might happen, how systems fail, and what steps the designers may wish to consider to mitigate the effects of such failures. Criteria for acceptable system performance and the calculations for system reliability are also discussed, again, as background when considering how to manage extended testing.

9.10.1 System Failure Modes and Effects

TMS subsystem failures can result from a large number of different causes, and a particular failure event can have a variety of effects. This section examines some of the more common failure events and the design and operational factors that can mitigate the effects of those failure events. It also examines failures of typical critical communication infrastructure components and addresses their failure modes and effects.

Table 9-1 presents some of the more common events that can cause failures and the factors that can mitigate their occurrence and/or severity. Note that redundant capabilities are listed as mitigating factors for cable plant damage and power outage events only. While it could be argued that some form of redundancy could mitigate the effects of all of these causal events, it would be true only when that redundant capability is geographically separated or provided by different means or methods other than the primary capability. That is, the causal event would not affect both the primary and redundant capability in the same way at the same time. Since this is typically not the case, the mitigating factors become very important and must be considered as well as possible redundancy options when developing the project specifications and requirements.

Table 9-1. Common Failure Causes and Mitigating Factors
Causal Event Mitigating Factors
Lightning
  • Lighting Arrestor
  • Attenuators/Filters
  • Non-Conducting Materials
  • Proper Bonding and Grounding/Isolation
Fire
  • Material Selection
  • Elimination of Ignition Sources
  • Fire Suppressant/Extinguisher
Flood
  • Site Prep/Drainage
  • Equipment Location Enclosures/Seals
  • Alarms
Wind
  • Structure/Support/Strain Relief
  • Mounting
  • Enclosure
Temperature
  • Component Selection
  • Ventilation
  • Insulation
  • Mounting (Expansion/Compression)
  • Heating Ventilation and Air Conditioning
Humidity
  • Component Selection
  • Coatings
  • Enclosures/Seals
  • Heating Ventilation and Air Conditioning
Shock and Vibration
  • Component Selection
  • Mounting and Isolation
Vandalism
  • Access Controls
  • Surveillance
Animals/Insects
  • Site Prep/Clear Vegetation
  • Cover/Close Access Ports
  • Screen Vents
  • Remove Debris and Refuse
  • Regular Inspections
Power Outage
  • Notification/Coordination of Activities
  • Utility Locates
  • Redundant/Secondary Feed (Long Term)
  • On Site Generator (Days to Weeks)
  • Uninterruptible Power System (Short Term)
Cable Plant Damage
  • Utility Locates
  • Redundant Cable
  • Notification/Coordination of Activities
Improper or Incorrect Maintenance
  • Staffing/Training
  • Management Oversight
  • Diagnostic Tools
  • Logistics (Spares/Provisioning and Deployment)
  • Preventive Maintenance/Inspections
  • Upgrades/Process Improvement
  • Communication/Coordination of Activities
Improper or Incorrect Operation
  • Management Oversight
  • Staffing/Training Communication/Coordination of Activities
  • Upgrades/Process Improvement

Table 9-2 lists the typical critical components, their network locations, failure modes and effects, and how those failures would be detected and isolated for a typical TMS fiber optic communications infrastructure. As shown in the table, most failures would be automatically detected by the network management system and isolated to the component level by either the network management system or maintenance staff and in some cases with the assistance of the network administrator and software staff. An optical time domain reflectometer (OTDR) is the primary tool used by the maintenance staff to locate a problem with the fiber optic media and verify its repair. The mean time to restore (MTTR) includes the time to detect and isolate the failure as well as test the repair needed to restore full functionality. MTTR is estimated based on having spare critical components strategically pre-positioned and a well-trained technical staff skilled in the use of the network management system, OTDR, and other necessary tools.

Table 9-2. Failure Modes and Effects
Critical Component Failure or Fault Location Effect Detection Isolation Mean time to Restore
Fiber Distribution Center Fiber Optic Pigtail/ Fiber Optic Connector Network Node Link Loss/
Multiple Link Loss
NMS Maintenance Staff < 4 Hrs.
Fiber Optic Jumper Fiber Cable/
Fiber Optic Connector
Network Node Link Loss NMS Maintenance Staff < 1 Hr.
Splice Single Fiber/
Multiple Fibers
Field Splice Box Loss of 2-way Comm/ Multiple Link Loss NMS OTDR 4 Hrs to 2 Days
Fiber Backbone Elongation/
Bend Radius/ Abrasion/
Partial Cut/
Sever
Turnpike Mainline Performance. Loss/
Multiple Link Loss/
Dn. Stream Comm Failure
NMS OTDR 1 to 4 Days
Fiber Drop Partial Cut/
Sever
TMC or Equipment Site Loss of 2-way Comm/
Comm Failure
NMS OTDR 1 to 2 Days
Network Repeater Input Port/
Output Port/
Power Supply/
CPU
Network Hub Site Node and/or Dn. Stream Comm Failure NMS NMS/ Maintenance Staff < 4 Hrs.
Network Switch/ Router Input Port/
Output Port/
Power Supply/
CPU/
Routing Table/ Software
TMC empty cell NMS NMS/
Maintenance Staff/
Network Administrator
2 to 4 Hrs.
Hub Input Port/Output Port/Power Supply/CPU TMC or Equipment Site Link or Multiple Link Loss NMS NMS/
Maintenance Staff
< 2 Hrs.
Network Management Host Network Interface Card/
Power Supply/
CPU/
Operating System/ NMS Software
TMC Loss of Comm. Subsystem Health and Status Data and Reconfig. Capability Network Admin. Network Administrator/ Maintenance Staff/ Software Staff < 1 Hr. (Switchover to Hot Standby)
Network Server Network Interface Card/
Power Supply CPU/
Operating System/
Application Software
TMC or Equipment Site Loss of System/
local Functionality
NMS Network Administrator/
Maintenance Staff/
Software Staff
< 4 Hrs.
NMS = Network Management System


An examination of table 9-2 suggests that meeting a high (e.g., 99 percent) availability goal would be difficult to achieve with MTTRs exceeding the 8 hours that would be allowed in a 30 day period to meet a 99 percent availability goal for some failure events, unless these events have a low probability of occurrence. One way to mitigate these failures is to provide redundancy for the critical components with potentially high MTTRs. For example, if a backup TMC is implemented that will provide a hot standby capability for the primary TMC network management host and network servers, the estimated MTTR can be much less than one hour (maybe seconds) assuming a switchover to the backup TMC occurs following a failure at the primary TMC. Without this capability, it could take a day or more to restore full functionality at the TMC even if the necessary spares were available. Note that once a switchover to a redundant capability at the backup TMC is accomplished, a subsequent failure of that capability would result in an outage that could take a full day to recover, unless there are multiple levels of redundancy or the primary TMC repair occurs before failure at the backup TMC.

Since the fiber backbone, fiber drops, and associated splices have high estimated MTTRs, it would be prudent to implement some type of redundancy to mitigate the impact of a failure event for these elements as well.

If backup and redundant elements are part of the overall project requirements and specifications, it is important that the testing program, at all levels, verify the switch-over times, the recovery process, and the system's ability to detect and alert the operators to the failure(s). Such testing should include multiple and compound failures of all systems and components. This type of testing should be representative of the failures that will occur; i.e., simply using the computer console to "halt" a process is not the same is shutting the power down and observing the result.

9.10.2 System Reliability, Availability and Redundancy

The intent of a system availability requirement is to set a standard for acceptable performance for the system as a whole to avoid installing a system that does not meet operational needs or, worse, is not reliable (as defined in the requirements). Requiring a system to meet a specific performance standard with respect to reliability and availability at the outset typically comes with a very high initial cost. This is primarily due to over design and over specification coupled with the attendant analysis and testing needed to verify that a specific performance standard has been met. Because reliability and availability are related, setting a goal (rather than a hard requirement) for system availability may allow both to be achieved over time through a process of continuous improvement and can result in a significantly lower overall cost. For this approach to work, however, it is essential that system operational performance and failure data be collected to determine whether the availability goal is being met and thus whether and where improvements are necessary.

Defining an acceptable level of availability for a large, complex system can be a daunting task. There are two key aspects to this task:

  • Identifying those functions (hence components and processes) that are deemed critical to system operation and the stated system mission. It is assumed that the loss or interruption of these functions for longer than some pre-determined time interval is defined to be system failure.
  • Determining the duration of operation without failures (i.e., failure-free operation).

Ideally, one would like to have a very long period of failure free operation, particularly for the critical functions. The reality is that the system components or a software process will fail or that some aspect of the system's performance will eventually degrade below an acceptable level. All that one can reasonably expect is that the failure is quickly detected and diagnosed, and the repair or replacement of the failing item is completed as soon as possible, thus restoring normal operation.

If one cannot tolerate the loss or interruption of a critical function (even for a short interval), some form of active redundancy is required. That is, some alternate means of accomplishing the critical function must be instantly available. Several levels of redundancy might be required to reduce the probability of a loss or interruption to near zero. If a failure can be tolerated for a short period of time, then there is the possibility that the critical function can be restored within that time interval, either by switching to a standby redundant capability or by repairing or replacing the component or process causing the loss of functionality. The longer the outage can be tolerated, the greater the likelihood that the critical function can be restored without relying on redundancy. Hot standby redundancy is always an expensive solution and is usually not necessary or required unless the critical function has a life safety aspect or is considered to have other mission critical real-time dependencies.

In order to set a system availability goal that is both meaningful and reasonable for the TMS, it is necessary to define some terms and discuss some mathematical relationships.

Availability (A) is the probability that an item will operate when needed. Mathematically, it is defined at the ratio of the failure free service interval to the total in-service interval typically expressed as:

A = MTBF/(MTBF+MTTR)

Where:

Mean Time Between Failures (MTBF) is the average expected time between failures of an item, assuming the item goes through repeated periods of failure and repair. MTBF applies when the item is in its steady-state, random-failure life stage (i.e., after the infant mortality and before the wear-out period), and is equal to the reciprocal of the corresponding constant failure rate, the Mean Time To Failure (MTTF).

Mean-Time-To -Restore (MTTR) is the average expected time to restore a product after a failure. It represents the period that the item is out of service because of the failure and is measured from the time that the failure occurs until the time the item is restored to full operation. MTTR includes the times for failure detection, fault isolation, the actual repair (or replacement), and any re-start time needed to restore full operation.

Reliability (R) is the probability that an item will perform a required function under stated conditions for a stated period of time. Mathematically, reliability is typically defined as:

R = e-T / MTBF

Where:

e is the base of the natural logarithm (2.718...)

T is the time of failure free operation

MTBF is mean time between failures or 1/MTTF.

For example, if an item had a demonstrated MTBF of 2000 hours, what is the probability of achieving 200 hours or failure free operation?

R = e-200/2000 = 0.905 or 90.5%

Thus, there is a 90.5 percent probability that 200 failure free hours or operation could be achieved. Continuing with this example: if the item can be repaired or replaced and returned to service in 4 hours what is the expected availability during the failure free interval?

A = 2000/(2000+4) = .998 or 99.8%

With a 4-hour restoration time, the item can be expected to be available for service 99.8 percent of the time.

The above examples are very simplistic and only apply to a single item. For large, complex systems, reliability is typically assessed for the critical path, i.e., the series of components and processes when taken together provide critical system functionality. It is computed as the product of the reliabilities of the components/processes on that path. In practice, estimating a system's reliability and availability would be a very difficult task and require an enormous effort even if all of the necessary failure and repair statistics were available, including the appropriate usage assumptions, confidence levels, weighting factors, and a complete understanding of all the failure modes and effects for each of the system's components. The operating agency can, however, impose a system-level availability goal, define critical functions, and collect operational performance data with respect to failure free operation time and time to restore operations for those critical functions. This information can be used to compute actual system availability for comparison against the proposed goal. The information collected will be useful in determining whether the current operational performance level is acceptable and what needs improvement.

Suppose that a service outage of 12 hours during a 24-hour by 5-day operational period were tolerable, the system availability goal would be:

A= (24*5-12/(24*5) = 108/120 = 0.90 or 90%

A 90 percent availability goal may not be high enough initially, but this value does allow for a significant failure restoration time and accounts for single as well as multiple failure events. The 12 hours allotted includes the time to detect the failures, dispatch a maintenance technician and/or software engineer, diagnose and isolate the problem, repair or replace the problem component or process, test, and, if necessary, re-start the operation. If the operating agency finds that the 90 percent availability goal results in an unacceptable operational service level, it can be raised to force improvements to be made.

Note that service outages can be caused by both unplanned events and scheduled maintenance and upgrade events. The effects that determine the duration of the outage include whether or not there is any redundancy and the switchover time, as well as failure containment (i.e., minimizing the propagation of a failure once it occurs). Effects that influence recovery time include failure detection and fault isolation times, repair or replacement, and functional test or workaround times. Hardware and software upgrades and plans to minimize service outages through provisioning of spares, critical replacement components, and diagnostic tools are all part of a contingency recovery plan that can be implemented to accomplish availability improvements.

An availability goal forces the operations agency to identify the system's critical functions and collect actual system performance and failure data with respect to those critical functions to determine whether that goal is being met. Without a goal, poor system performance will still be noticed and be unacceptable, but there won't be any hard information as to what needs improvement and by how much. An availability goal also requires the operations agency to address contingency recovery planning which might otherwise be overlooked.

Reliability goals are often used during the "observation" period to determine pass/fail criteria. In this manner, the level of availability can be measured and the determination of when to suspend, terminate, or continue the observation period can be established and measured.

9.11 Testing Myths

The following are two commonly held beliefs concerning testing that in reality, are myths.

9.11.1 Elimination of Software Errors

The belief that software errors, or bugs, can be eliminated by extensively testing the final product is a myth. Well-written software requirements can be verified at least to the functional and operational level. However, one of the unique problems that testing software has is establishing a test environment and developing appropriate test stimuli that are both sufficiently robust and directly comparable to the real-world operational environment. In addition, because of the nearly infinite number of possible paths through the software code that are created by the necessary conditional statements and code modules, testing each possible path takes an infinite amount of time or resources. Only after long operational periods under a variety of conditions and performance stress will most software errors be detected.

Once detected, they can be fixed, or operational procedures changed, to avoid problem conditions. When an error condition or anomalous event occurs or is suspected, a specific test can usually be developed to capture triggering conditions or circumstances and allow the investigation and resolution of the problem. The key is identifying the conditions that allow the test team and the software developer to produce a test case that reliably reproduces the problem. This is true only when the problem is repeatable or can be reliably triggered by some stimulus.

Such tests or test requirements are rarely developed beforehand since the anomalous behavior is not contemplated or expected. If it were, the design should have included a means for avoiding or detecting the problem and taking some corrective action. Moreover, if included in the design or operational requirements, an acceptance test (design review, code inspection and walk through, unit test, build test, integration test or system test) would have verified the desired behavior.

When a problem is not repeatable, i.e., it appears to occur randomly under a variety of different conditions or circumstances, it is most often a software problem rather than a hardware problem. Hardware tends to exhibit intermittent failures under similar conditions or stress and circumstances related to the physical environment. Finding and fixing a problem that cannot be readily triggered by a specific set of conditions or stimulus requires a tremendous amount of luck and technical skill. Over time and as other problems are resolved, these seemingly intractable problems sometimes resolve themselves (because they were caused by interactions with other problems), or they become repeatable such that they can be found and fixed, or simply become less bothersome and easier to live with. A software component or operating system upgrade may ultimately fix this class of problems as there are often obscure bugs in the operating system software that only become active under a specific set of circumstances, which may not be repeatable. A word of caution which was also noted earlier: one method often used to track the cause of intermittent software problems includes the use of debugging tools provided by the operating system, COTS products, or compilers. The introduction of these debugging aids can also perturb the inter-process timing relationships so that the problem "disappears" when the debugging aids are present, and re-appears when they are turned off.

A practical solution is to retain competent software development and test personnel throughout the operational life of the system to deal with the inevitable software errors. It is recommended that software maintenance contracts be executed with all of the COTS and software providers. Most software developers will be continuing to test, enhance, and fix bugs as they are discovered. The software maintenance contract provides a mechanism for introducing those changes into existing systems. However, this approach may also have its share of problems and needs to be part of a configuration management program. Upgrades can have both positive and negative results - it is important that upgrades be tested in an isolated environment and that roll-back procedures be in place in case the upgrade is worse than the existing system or is not applicable to a specific platform or operating environment.

9.11.2 Software Version Control

The belief that a software bug found, fixed, and cleared by verification testing will stay fixed is not necessarily true. The problem was resolved and testing verified that it had been fixed; yet it magically appears again. How is this possible? There could be a number of reasons; perhaps the problem was not really fixed or perhaps the problem was simply masked by other "fixes" or features. In some cases, a new release has replaced the previous release and somehow, during the development of the new release, the old problem reappeared. The reason is that the fix was not incorporated in the newer releases - it was effectively lost when the new release was created.

This typically results from a version control problem (or software configuration management lapse). All the elements that were suppose to be in the new release (including the subject fix) were not properly documented and accounted for when the new release was built, and because it was documented improperly, the regression test that should have incorporated the test for the fix did not occur. Hence, the new release was installed and passed the regression testing, but that testing failed to test for the old problem. In subsequent installation testing with the new release, the old problem resurfaced. The subject fix will now have to be incorporated into a new release and re-tested.

Software version control is extremely important, particularly for a large development project that may have multiple release versions in various stages of integration testing at the same time. The agency needs to periodically audit the CM program to ensure that all problems have been noted, and included in subsequent releases.

9.12 Testing Tradeoffs

There are a number of testing tradeoffs that can have a favorable impact to cost, scheduling or resources required to conduct the overall testing program. Three examples are provided here.

9.12.1 Accepting Similar Components

Whether or not to require the full test suite when similar components are added to expand an already accepted and operational system can be a difficult question. The risk lies in how similar the new components are to the ones already accepted (e.g., are the electrical and communication interfaces the same, will the component fit in the available space, etc.). The safest, least risky course of action is to subject each new component to the same level of testing (if not the same test procedures) used for the currently accepted components. If, however, there are some risk-mitigating circumstances, such as the fact that the product(s) in question are from the same vendor, are listed on a QPL, or are in wide use by others in similar environments, then consideration should be given to abbreviating the testing requirements for these components and accepting vendor test results or a certificate of requirements compliance from the vendor for at least the lower level unit and factory acceptance testing in order to reduce the impact on the testing budget.

9.12.2 Using COTS

Commercial-off-the-shelf (COTS) products, if found to meet the necessary requirements, can save a great deal of money in both development and testing costs that would otherwise be associated with a new or custom product. The tradeoff here is in the continuing licensing costs to use the product and the cost of product maintenance (including product upgrades to remain eligible for vendor maintenance) over the product's useful lifetime vs. reduced testing costs as compared to cost to develop, test, and maintain a custom product. COTS products will usually save money in the short run and will allow needed functionality to be achieved more quickly; however, the longer they are used, the more expensive they become. Eventually, COTS may cost more than a custom product over which you would have had complete control to upgrade, modify, and otherwise use as you see fit. In addition, if you choose a COTS product, you will have to tailor your requirements and operations to meet those of the product and accept the fact that some desired features will not be implemented. For some class of products such as operating systems, relational data base management software, computers, servers, routers, etc. the choice is clear: choose a COTS product. You can't afford to develop these nor should you need to.

For the TMS application software, the choice will depend on the available budget, schedule, and the specific operational needs of the system. The agency needs to carefully review the proposed solution and be comfortable with the "adaptations" required to use the product in their environment. Be mindful that the benefits of using a COTS product can be lost when significant customization is contemplated. Some companies have spent more to modify an existing system to meet their needs than a whole new system might have cost. With today's modular software, it may be possible to assemble a system from well-known and tested modules that minimize the new development required.

Another consideration is the ongoing software maintenance where your choice is a COTS TMS application vs. a custom developed application. If your implementation is unique, you can expect that your agency must bear the full cost of all software support, including upgrades when required to replace hardware that has become obsolete. If your implementation relies on a vendor's "standard" software, then the maintenance costs are likely being shared amongst all of the clients using this software. When it comes to testing new releases or changes, each approach has its own challenges. The use of COTS application software generally means that the vendor must simply update their previous test procedures to demonstrate the new features and functions; with custom software, it is likely that the agency will need to develop the revised test procedures. Further, with custom software, it is likely that there will be no track record of previous testing with the new release which will require that the agency be more rigorous in its testing program.

9.12.3 Testing Legacy Components

Legacy components, i.e., those leftover from a previous system incorporated in the new TMS (unless operated and maintained as a stand-alone subsystem) will have to be integrated with the new system hardware and software. If all that you have for these components is operations and maintenance manuals, i.e., the original requirements, design documents and as-built installation drawings are either non-existent or inadequate; you will be faced with having to reverse engineer the components to develop the information you need to successfully incorporate them into you new system. Testing of these components will be impossible unless a requirements baseline can be established and a "black box"38 approach used. In addition, unless spares and replacement parts are available, maintenance may also be challenging. It may make sense to operate a legacy system as a stand-alone subsystem until it can be functionally replaced by components in the new system. The tradeoff here is that some initial cost, schedule, and resources may be saved by using the legacy components as a stand-alone subsystem, but, for the long-term, legacy components should be replaced by functionality in the new system.

9.13 Estimating Testing Costs

The test location, test complexity, number and types of tests, and the test resources required (including test support personnel, system components involved, and test equipment) impact testing costs. Testing is expensive, and estimating test costs is a difficult and complex task that won't be attempted here, except as an example for a hardware unit test given below. What is important and is stressed here is that these costs, while a significant portion of the overall TMS acquisition budget, should not dissuade you from conducting a thorough and complete test program that verifies that each of your requirements has been met. You ultimately control testing costs by the number and specificity of your requirements. A small number of requirements with minimum detail will be less costly to verify than a large number of highly detailed requirements. Both sets of requirements may result in similar systems, but the smaller, less complex set takes less time and resources to verify. Be very careful of what you put in the specification requirements — less requirement detail, unless absolutely necessary, does two things: (1) it allows a more flexible approach to design and implementation, and (2) it reduces the cost to verify the requirement. This approach should be balanced with the agency's expectations since there may be various means by which the requirement could be satisfied by a vendor.

Hardware unit testing can be especially expensive and can significantly add to the cost of a small number of devices. Consider the actual cost of the testing; as a minimum, the agency should send at least two representatives to a planned test-typically these include an inspector and a technical expert. Most testing should also include the project manager-which increases the number to three people, one of whom is typically a consultant (technical expert). The NEMA testing will typically require a minimum of 4 days to complete, and product inspection can easily add an additional day unless additional product is available for inspection. Given the above, the cost is typically 120 hours plus preparation and report generation (add an additional 32 hours) with 5 days each for per diem expenses as well as airfare and local transportation. The costs can easily range from $12,000 to $15,000.39 In addition to these direct agency costs, the vendor will incur the cost for laboratory facilities, vendor personnel to conduct the test, and the preparation of the test procedure. One needs to consider these costs when specifying a "custom" product, as they are real costs to the agency and the vendor's costs will be reflected in the cost of the product.

9.14 Summary

The above testing considerations address specific issues that the acquiring agency has control of at the outset of the testing program. Do not neglect these issues; most will have to be dealt with at some point in your testing program. It is better to plan for tem and deal with them early in the project life cycle rather than reacting to them later under pressure.




37 The following terms are defined in MIL-STD-490A Specification Practices, Section 3.2.3.6 Use of "shall," "will," "should" and "may."

38 Black box testing is based only on requirements and functionality without any knowledge of the internal design or code.

39 This is the estimated cost for all 3 people (inspector, consultant, and project manger) and assumes a typical per diem of $150 per day, $650 airfare, $70/hour average labor costs (includes consultant hours) and an allowance of $200 for local transportation.

Previous | Next
Office of Operations