Office of Operations
21st Century Operations Using 21st Century Technologies

Data Quality White Paper

2.0 Literature Review of Previous Efforts

This chapter highlights the existing literature in data quality measures with an emphasis on traffic data. The subject of traffic data quality has been an issue since the earliest days of traffic data collection. The growing deployments of ITS projects across North America and worldwide require extensive evaluation of data quality issues. While new data collection technologies, data collection methods, and their analytical studies are widely used in ITS projects, relatively few studies have been conducted to evaluate the quality of traffic data. The following section discusses recently conducted research efforts regarding traffic data quality.

2.1 "Guidelines for Data Quality for ATIS Applications" (2000)

Recent research and analysis efforts have identified several issues regarding the quality of traffic data available from ITS applications for transportation operations, planning, or other functions. For example, ITS America's Advanced Traveler Information Systems (ATIS) Committee formed a Steering Committee and developed the "ATIS Data Gaps Workshop" in 2000 which identified information accuracy, reliability, and timeliness as critical to ATIS. The key findings of the "Quality Advanced Traveler Information System (ATIS) Data", are the following [ITS America and U.S. Department of Transportation, Closing the Data Gap: Guidelines for Quality Advanced Traveler Information System (ATIS) Data. 2000: Washington, D.C.]:

  • Guidelines for quality ATIS data are desirable
  • Need for further refinement in classifying types of data, quality attributes for each type of data, and quality levels for each attribute
  • Guidelines for quality data go beyond ATIS.

The Steering Committee identified five reasons for publishing this document:

  • Raise awareness of the need for data collection planning
  • Increase the amount of traffic data being collected
  • Increase the quality of traffic data being collected
  • Increase the recognition of the value of data
  • Encourage similar efforts for traffic management, transit management, and transit-related and rural traveler information data collection

One of the earliest efforts for data quality was drafting the guidelines for Quality ATIS data. The report Closing the Data Gap: Guidelines for Quality Advanced Travel Information System Data provided useful insight into the required processes for data quality [ITS America and U.S. Department of Transportation, Closing the Data Gap: Guidelines for Quality Advanced Traveler Information System (ATIS) Data. 2000: Washington, D.C.]. To establish these guidelines, two separate issues should be considered, namely: data content and data access. The data content defines the data type, coverage, and quality of the data collected while the data access issue covers availability of data to organizations for use in creating ATIS products and services. The report also states that the most frequently cited reason for insufficient data quality is inadequate geographic coverage. The data quality issues were mainly raised from incomplete data collection efforts in metropolitan areas with multiple jurisdictions, particularly with respect to traffic speeds. The report identified inadequate geographic coverage, inaccurate information, insufficient update frequency, lack of data timeliness, and inadequate spatial resolution as the most common complaints. The data guidelines defined four types of real-time traffic data: traffic sensor data, incident/event reports, images, and road/environmental sensor station data. Each data type consists of the attributes and the desired data quality levels. Table 1 illustrates traffic sensor data types, their attributes, and data quality levels.

Table 1. Attributes and Quality Levels of Traffic Sensor Data

Attributes

Quality Levels

Nature:

Limited Access Highways – Aggregated Point Data
Principal Arterials – Aggregated Section Data

Accuracy

< 15% error

Confidence

Qualitative measure of suspicious data communicated along with the data

Delay

< 5 minutes

Availability

> 95% availability

Breadth of Coverage

Limited Access Highways – Major Roadways
Principal Arterials – Major Roadways

Depth of Coverage

Limited Access Highways – Between Major Interchanges
Principal Arterials – Between Major Arterials/Limited Access Highways

The guidelines also developed the quality levels, "good", "better", and "best" to assess the data attributes. A "good" quality level is the minimum level of data collection that should be designed for each attribute and "better" and "best" quality levels provide an improved level of service.

ATIS data guidelines are a useful indicator that offers the opportunity to enhance and improve the available ATIS data and applications. While the guidelines are limited to real-time or dynamic traffic-related information to offer traveler information services in the near-term, the guidelines provide the resources to be used for ATIS data collection. Also the guidelines need further refinement in classifying types of data, quality attributes for each type of data, and quality levels for each attribute.

2.2 "Traffic Data Quality Workshop Proceedings and Action Plan" (2003)

The quality of the traffic data and the information are critical factors since traffic and travel condition information affects the management of transportation resources and is utilized by the traveling public in making travel decisions. In 2003, FHWA designed traffic data quality workshops and developed an action plan to help stakeholders with traffic data quality issues. The workshops were designed to present the findings of three white papers in order to stimulate discussion and obtain input from the participants on how to address the concerns of traffic data quality. The three white papers are as follows:

  • Defining and measuring traffic data quality
  • State of the practice for traffic data quality
  • Advances in traffic data collection and management

The following sections summarize the three white papers and discuss an action plan report to improve traffic data quality [Turner, S., Defining and Measuring Traffic Data Quality: White Paper on Recommended Approaches. Transportation Research Record, 2004(1870): p. 8., Fekpe, E. and D. Gopalakrishna, Traffic Data Quality Workshop Proceedings and Action Plan. 2003, Prepared for Federal Highway Administration: Washington, D.C., Middleton, D., D. Gopalakrishna, and M. Raman, Advances in Traffic Data Collection and Management: White Paper. 2003, Prepared for Federal Highway Administration: Washington D.C., Margiotta, R., State of the Practice for Traffic Data Quality: White Paper. 2002, Prepared for Federal Highway Administration: Washington D.C.].

2.2.1 Defining and Measuring Traffic Data Quality

The definition of data quality is a relative concept that could have different meaning(s) to different consumers. Even if data are good enough for one user; the same data might not be of acceptable quality for another consumer. Thus it is important to consider and understand all intended uses of data before attempting to measure or prescribe data quality levels. Many researchers [Strong, D.M., Y.W. Lee, and R.Y. Wang, 10 Potholes in the Road to Information Quality. Computer, 1997(Aug.): p. 38-46., English, L.P., 7 Deadly Misconceptions About Information Quality. 1999, Brentwood, Tenn.: Information Impact International, Inc., English, L.P., Improving Data Warehouse and Business Information Quality. 1999, New York: John Wiley and Sons.] defined data quality as "fit for use by an information consumer", "fitness for all purposes in the enterprise processes that require it", "phenomenon of fitness for 'my' purpose that is the curse of every enterprise-wide data warehouse project and every data conversion project," and “consistently meeting knowledge worker and end-customer expectations.” A white paper Defining and Measuring Traffic Data Quality was prepared for the traffic data quality workshops [Turner, S., Defining and Measuring Traffic Data Quality: White Paper. 2002, Prepared for Federal Highway Administration: Washington D.C.]. The white paper defines the data quality as "the fitness of data for all purposes that require it. Measuring data quality requires an understanding of all intended purposes for that data."

The white paper also proposed the following data quality characteristics:

  • Accuracy – The measure or degree of agreement between a data value or set of values and a source assumed to be correct.  It is also defined as a qualitative assessment of freedom from error, with a high assessment corresponding to a small error.
  • Completeness (also referred to as availability) – The degree to which data values are present in the attributes (e.g., volume and speed are attributes of traffic) that require them.  Completeness is typically described in terms of percentages or number of data values.
  • Validity – The degree to which data values satisfy acceptance requirements of the validation criteria or fall within the respective domain of acceptable values.  Data validity can be expressed in numerous ways.  One common way is to indicate the percentage of data values that either pass or fail data validity checks.
  • Timeliness – The degree to which data values or a set of values are provided at the time required or specified. Timeliness can be expressed in absolute or relative terms.
  • Coverage – The degree to which data values in a sample accurately represent the whole of that which is to be measured. As with other measures, coverage can be expressed in absolute or relative units.
  • Accessibility (also referred to as usability) – The relative ease with which data can be retrieved and manipulated by data consumers to meet their needs. Accessibility can be expressed in qualitative or quantitative terms.

While there are several other data quality measures that could be appropriate for specific traffic data applications, the six measures presented above are fundamental measures that should be universally considered for measuring data quality in traffic data applications. The white paper also recommended that goals or target values for these traffic data quality measures be established at the jurisdictional or program level based on a better and more clear understanding of all intended uses of traffic data. It is evident that data consumers' needs and expectations, as well as available resources, vary significantly by the implementation program and by geographic area. The facts preclude the recommendation of a universal goal or standard for these traffic data quality measures. Finally the paper also recommended including metadata in establishing data quality.

2.2.2 State-of-the-practice for Traffic Data Quality

The White Paper State of the Practice for Traffic Data Quality examines what operations and planning applications use traffic data and what are the quality requirements for these applications, the causes of poor quality in traffic data, quality issues specific to ITS-generated traffic data, and possible solutions to quality problems [Margiotta, R., State of the Practice for Traffic Data Quality: White Paper. 2002, Prepared for Federal Highway Administration: Washington D.C.].

The study highlights the traffic data collection procedures by types and applications. Several types of traffic data are collected by both “traditional” and ITS means. While the basic nature and definitions of the data collected are the same, there are subtle differences in data collection methodologies that may lead to problems with data sharing and quality. For example, for planning purposes traffic volume is typically collected continuously at a limited number of sites statewide; 24-48 hour counts cover most highway segments; data are usually aggregated to hourly averages for reporting purposes. However, for many ITS applications traffic volumes are gathered continuously on every segment (1/2 mile spacing is typical on urban freeways); data often are collected at 20-30 second intervals in the field; data are aggregated and reported for later use anywhere from 20-30 seconds up to 15 minutes. The paper explores various types of data and applications with the comparisons of current (or traditional) data and ITS-generated data.

The characteristics of traffic data quality are explained defining "Bad" data. Bad or inaccurate traffic data are a result of various factors such as type of equipment, interference from environmental conditions, installation, calibration, inadequate maintenance, communication failures, and equipment breakdowns. In order to detect the bad data, a variety of methods are used, including internal range checks, cross-checks, time series patterns, comparison to theory, and historical patterns are used. Once the bad data are found, imputation appears to be most applicable where small intermittent gaps appear in the data instead of editing the measurement values. Various techniques including time series smoothing and historical growth rates have been explored while there is little consensus in the profession on what techniques to be used, or if imputation should be done at all.

The study also pointed out the difference between operational and traditional use of ITS generated traffic data. Several differences are introduced as these points: volumes vs. speeds, data quality control methods, level of accuracy, data collection nuances, data management, level of coverage, vehicle classification definitions, institutional and data sharing issues. Finally, sampling of ITS locations and data streams, shared resources, maintenance, calibration, and performance standards, contractual arrangements, more sophisticated operations applications as a data quality leader, and new technologies are recommended as possible solutions to improve traffic data quality in the study.

2.2.3 Advances in Traffic Data Collection and Management

The white paper Advances in Traffic Data Collection and Management identifies innovative approaches for improving data quality through quality control. The study recommends innovative contracting methods, standards, training for data collection, data sharing between agencies and states, and advanced traffic detection techniques [Middleton, D., D. Gopalakrishna, and M. Raman, Advances in Traffic Data Collection and Management: White Paper. 2003, Prepared for Federal Highway Administration: Washington D.C.]. Each methodology to improve data quality is described in this section.

The paper first introduced the innovative contracting methods that can improve data quality. A few agencies around the country have already invested resources in developing new contracting methods as a means of ensuring data quality. The study introduced the examples of Virginia and Ohio as case studies to show the potential data quality improvement though innovative contracting methods such as performance-based lease criteria for payment of data collection services and a task-order-type contract for maintenance.

The development of standards is introduced as an important aspect of traffic data quality. While standards development is still at an early stage in the United States, many European countries such as Germany, the Netherlands, and France have developed national standards for data collection equipment. All equipment purchased for national traffic data collection utilize the same formats and protocols for communication purposes. The standardization in European countries has increased the quality and accuracy of the data collected, decreased the effort needed to transfer data between agencies or offices, and increased the reliability of field equipment. However, the standardization increased the initial cost of the equipment when compared to non-standard equipment.

Training of personnel is an essential part of ensuring data quality since rapid changes and improvements of hardware and software require constant training.

Data sharing between agencies can result in cost savings and provides alternate means to meeting data quality needs. For example, the white paper demonstrated that the states of Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont have cooperated to help each other and share transportation data such as inventory, travel monitoring data, and performance data. By working together for many years, these states have improved data quality in a more efficient and cooperative environment.

Finally advanced traffic detection techniques are discussed to ensure that the data gathered are accurate. The study demonstrates that inductive loop detectors continue to effectively serve their needs. Also most failures originated from improper sealing, pavement deterioration, and foreign material in the saw slot, not because of the loop wire itself. Recent research efforts found that multi-lane detectors are most competitive from a cost and accuracy standpoint. Video imaging systems also provide an image of traffic, which is often useful in spot-checking traffic conditions.

2.2.4 Action Plan Development of Traffic Data Quality Workshops

The report defines action plans to address traffic data quality issues. The action plan presents a blueprint to address the traffic data quality based on the findings in the white papers and input received from the regional workshops [Fekpe, E. and D. Gopalakrishna, Traffic Data Quality Workshop Proceedings and Action Plan. 2003, Prepared for Federal Highway Administration: Washington, D.C.]. The following ten priority action items were identified.

  1. Develop guidelines and standards for calculating traffic data quality measures.
  2. Synthesize validation procedures and rules used by various states and other agencies for traffic monitoring devices (or compilation of business rules, data validity checks, and quality control procedures).
  3. Develop best practices for installation and maintenance of traffic monitoring devices.
  4. Establish a clearinghouse for vehicle detector information.
  5. Conduct sensitivity analyses and document the results to illustrate the implications of data quality on user applications.
  6. Develop guidelines for data sharing resources for traffic monitoring activities.
  7. Develop a methodology for calculating life-cycle costs.
  8. Develop guidelines for innovative contracting approaches for traffic data collection.
  9. Conduct a case study or a pilot test.
  10. Provide guidance on technologies and applications.

The ten action items were categorized into three potential groups of activities to implement the action plan, namely: research studies, workshops, and clearinghouse case. The research studies were related to (1) the development of guidelines and standards for calculating data quality measures, (2) compilation of business rules/data validity checks and quality control procedures, (3) best practices for equipment installation and maintenance, (5) sensitivity studies to demonstrate "value of data", and (10) guidance on technologies and applications are required. Also, action items that require workshops are (6) guidelines for sharing resources, (7) life-cycle costs of detection equipment, and (8) improved contracting approaches. Finally, action items such as (4) clearinghouse for vehicle detector information and (9) case study or pilot tests should be implemented through case studies.

2.3 "Traffic Data Quality Measurement" (2004)

In order to address the most demanding recommendation from the Traffic Data Quality Workshops which is "Developing guidelines and standards for calculating traffic data quality measures", the report Traffic Data Quality Measurement develops methods and tools to enable traffic data collectors and users to determine the quality of traffic data that they are providing, sharing, and using. Specifically, the report presents a framework that can provide methodologies to develop and evaluate the data quality measurement for different applications. Also the report provides guidelines for developing and calculating traffic data quality.

The developed framework is based on six data quality measures, namely: accuracy, completeness, validity, timeliness, coverage, and accessibility. The framework is constructed as a sequence of steps in calculating and accessing the data quality. The structure of the framework is as follows;

  • Step1. Know your customer
  • Step2. Select measures
  • Step3. Set acceptable data quality targets
  • Step4. Calculate data quality measures for unique data
  • Step5. Identify data quality deficiencies
  • Step6. Assign responsibility and automatic reporting
  • Step7. Complete the feedback cycle

Case studies were introduced to demonstrate how the data quality measures were calculated. Table 2 shows the traffic data quality scorecard for the Austin, TX case study. The results indicate that the quality of traffic detector data in the Austin case reasonably fits to the data quality target while only 13% of freeway sections are covered.

Table 2. Traffic Data Quality for Austin Case Study

Data Quality Measures

Original Source Data

Archive Database

Traveler Information

Accuracy

  • MAPE (Mean Absolute Percent Error)
  • RMSE (Root Mean Squared Error)

One-minute speeds:
12.0%
11 mph

Hourly volumes:
4.4%
131 vehicles

Travel times:
8.6%
1.56 minutes

Completeness

  • Percent Complete

Volume: 99%
Occupancy: 99%
Speed: 98%

Volume: 99%
Occupancy: 99%
Speed: 99%

Web site: 100%
Phone: 96%

Validity

  • Percent Valid

Volume: 99.9%
Occupancy: 99.9%
Speed: 99%

Volume: 97%
Occupancy: 98%
Speed: 99%

Route travel times:
97%

Timeliness

  • Percent Timely Data
  • Average Data Delay

99.8%
28 seconds

90%
3 hours

96%
n.a.

Coverage

  • Percent Coverage

Freeways: 13% with
0.4 mile spacing

Freeways: 13%
with 0.4 mile
spacing

Freeways: 13% with
0.4 mile spacing;
Arterials: 0%

Accessibility

  • Ave. Access Time

Archive admin.:
8 minutes;
ISP: 10 minutes

Retrieve AADT
values: 12 minutes
average access time

Web site: 20 second
average access time
Phone: 60 second average
access time

The guidelines include acceptable data quality targets, level of effort required for traffic data quality assessment, specification for using metadata, and guidelines for data sharing agreements. Data quality targets are defined for different applications using six data quality measures and prepared for the acceptable quality based on the data user's needs and applications. Table 3 shows the estimated data quality targets for sample applications.

Table 3. Sample Data Quality Requirement

 

Traveler information (Travel time)

Air Quality Conformity Analysis (VMT)

Accuracy

10-15% RMSE (Root mean sq. error)

10%

Completeness

95-100% valid data

At a given location 50% - two weeks per month, 24 hr

Validity

Less than 10% failure rate

Up to 15% failure rate -48 hour count,
Up to 10% failure rate –permanent stations

Timeliness

Data required closed to real-time

Within 3 years of model validation year

Typical coverage

100% area coverage

75% Freeways
25% Arterials
10% Collectors

Accessibility

5-10 minutes

5-10 minutes

The guidelines also present the data sharing agreement which explains the roles, expectations, and responsibilities among the data providers and users. The data sharing agreement typically does not include the data quality specifications between the data providers and the data users. The report recommends the following three steps to add the data quality provision into data sharing agreements.

  • Reporting/documenting the quality of the data
  • Specifying what the quality of the data must be
  • Structuring payment schedules based on amount of data passing minimum criteria

2.4 "Quality Control Procedures for Traffic Data Collection" (2006)

Quality control procedures which monitor and identify data quality problems are a critical factor in improving traffic data quality as defined in the action plans of the traffic data quality workshops. The report Quality Control Procedures for Archived Operations Traffic Data: Synthesis of Practice and Recommendations was prepared for the Federal Highway Administration. The report summarizes the data quality control procedures and provides recommendations for a set of quality control procedures for system-specific data quality issues [Turner, S., Quality Control Procedures for Archived Operations Traffic Data: Synthesis of Practice and Recommendations. 2007, Prepared for Federal Highway Administration: Washington D.C.].

Three general categories are typically utilized to identify the validity of traffic data. The first method is univariate and multivariate range checks which set minimum, maximum, or a range of expected values for a variable or multiple variables. Secondly, spatial and temporal consistency checks are also widely utilized to check for data validity. The method evaluates the consistency of traffic data as compared to nearby locations (either across lanes, or upstream and downstream monitoring locations) or previous time periods. The third method, detailed diagnostic, requires detailed diagnostic data from traffic detectors that are not typically available in archived traffic data. This criterion can be used in diagnosing the cause(s) for poor data quality at specific detector locations.

The study also reviewed the validity checks of the nine data archives: ADMS Virginia, California PeMS, CATT Lab, Central Florida Data Warehouse, FHWA Mobility Monitoring Program, Kentucky ADMS, Phoenix RADS, PORTAL, and WisTransPortal V-SPOC and found the following problems:

  • The validity criteria are similar among the nine different data archives
  • The validity criteria are less sophisticated and complex than those described in the literature
  • Nearly all of the validity criteria are programmed on a simple pass/fail basis
  • Most of the validity criteria do not have a specified order or sequence.
  • It appears that all validity criteria are applied even if previous criteria indicate invalid data

Finally, the report also provides the following recommendations for quality control procedures:

  • Recognize that validity criteria (i.e., quality control) are only one part of a comprehensive quality assurance process
  • Provide metadata to document quality control procedures and results
  • Provide metadata to document historical traffic sensor status and configuration
  • Use database flags or codes to indicate failed validity criteria
  • At a minimum, implement basic foundation for data validity criteria
  • Further develop other spatial and temporal consistency criteria for ADMS (Archived data management systems)
  • Use visual review to supplement the automated validity criteria.

2.5 Summary of Previous Efforts

The above review of the literature has shown the significance of traffic data quality and various contributing factors that can improve the quality of traffic data. One of the foremost recommendations suggested from researchers and workshops is that there is an urgent need to develop guidelines for traffic data quality. While the previous research endeavors have attempted to build these procedures, methodologies, and guidelines, the proposed approaches are too general to satisfy the requirements for a real-time information program. The following chapters will explore the data quality measures with associated applications for a real-time information program.