Appendix C. Global Positioning System-Independent Cruise Estimator Model Estimation

Overview

The GPS-independent cruise estimator, or G-ICE, addresses an important question: In the absence of GPS data, can models be developed to estimate the probability of cruising and what data sources would prove useful for this endeavor? The rationale for this tool is that not every jurisdiction will be able to obtain GPS data. Thus, to support those places, can a tool be developed to estimate cruising even in the absence of GPS data, at least of sufficient quantity and quality? By calibrating estimates from cities where there are large samples of GPS traces, the objective is to enable a low-cost way for geographical areas that do not have access to these traces to estimate cruising with minimal data requirements.

The research team prioritized data sources that have already been cleaned and are ready to be processed and carried out for the analysis across multiple cities. This affords us access to a large sample of GPS traces that will be required for calibrating the estimates. Potentially, the model could provide an order of magnitude estimate of cruising with minimal data requirements. Given the availability of the explanatory variables and depending on the predictive accuracy, G-ICE could be used nationwide and was thought might hold the promise of significant utility particularly for cities without access to sufficient high-quality GPS data.

Using both regression and machine learning approaches, the models were structured with cruising as a function of:

Vectors of covariates of the geographic area including the built environment,
A vector of travel- or trip-related covariates including the parking variable, and
A vector of temporal attributes such as time of the day and day of the week.

The research team implemented different structural forms of the regression and machine learning architectures to identify the one that provides the best predictive accuracy. Beyond the root mean square error and the mean absolute error, measures for the percentage of bad predictions were also provided using a 30-percent tolerance threshold value.

Empirical Models

To implement the models, the research team used StreetLight and Quadrant proprietary trip data from Seattle; Washington, DC; Atlanta; and Chicago with the analysis carried out at the census block group level. Explanatory variables include those with predictive power that are publicly available or easily accessible irrespective of the jurisdiction. Both regression methods and machine learning (multilayer perceptron²⁹ and generalized regression neural networks) approaches were employed for the data analyses.

Dependent variable is cruise—measured either by mean cruising distance or time and a categorical form of the cruise variable based on a 250-meter threshold. The categorical variable is 0 when the cruising distance is 0, 1 if there is cruising but the cruising distance is less than 250 meters, and 2 if the mean cruising distance is more than 250 meters.

Covariates include the following:

wkdy: dummy variable that captures whether the trip happened during a weekday or weekend (True = Weekday, False = Weekend)
peak: dummy variable based on the trip end – equals 1 if the time falls within the peak period and 0 otherwise
r_den: housing units per acre on unprotected land
j_den: jobs per acre on unprotected land
p_den: parking meters per acre on unprotected land
lnaadt: log of the average annual daily traffic (AADT) of all road segments within the CBG
city_dummy: dummy for cities; for example, the Seattle dummy equals 1 for all trip records from Seattle and 0, otherwise
d2a_ephhm: employment and household entropy calculations, where employment and occupied housing are both included in the entropy calculations

Interactions terms were also used. For example, a city dummy (e.g., for Seattle) was interacted with day-of-the-week dummy or if the trip ended during the peak hour. The variables above provide standardized and widely available data that could be used to develop and validate the model. Equally important, if G-ICE were to have enabled accurate predictions, it would have relieved cities of the need to purchase the data given that they are non-proprietary. It also provides a basis to carry out objective comparisons across jurisdictions. The comparative analysis component of the model provides us with the ability to generalize the process to additional cities.

Results

Three analyses were carried out: multiple linear regression analysis with the cruising estimated using conditional expectations, generalized regression neural networks (GRNN), and multilayer perceptron. In the regression analysis, what is relevant is the model-explained portion of the regression or the explained sum of squares (ESS) relative to the total sum of squares given that the focus is on prediction. Consequently, the emphasis is on the ESS that explains the variation observed in the modeled values. This is the case whether the mean cruising is measured based on time or distance. There was no difference in the goodness of fit when the interacted terms are included as regressors.

Only a marginal difference was observed for the mean square error from the GRNN compared to that of the linear regression. The percentage of bad predictions (a 30-percent threshold) also buttresses the results obtained from earlier methods. The plot of residuals versus predicted values has very high positive values for the residual—an indication that the predictions were too low. Ideally, these values should be close to zero given that the residual, for each observation, is simply the actual minus the predicted cruising.

To expatiate on the previous analyses, the mean cruising distance was changed to a categorical variable using a 250-meter average cruising distance threshold. A multilayer perceptron (MLP) neural network was subsequently run with cruise as the dependent variable and with two hidden layers, each with a hundred nodes (neurons). Using the predict command where the predicted value defaults to the option with the highest probability, a predicted accuracy of 88 percent across all the cities that were featured in the analysis was obtained. This, however, inflated the goodness of fit given the number of observations with zero cruise. For cruises classified as moderate (cruise=1), only 4 percent of the predictions are right while none of the predictions for high cruising (cruise=2) is right. This is a source of concern given that these are instances when a false negative cannot be afforded—situations where there is cruising and there may be need to put countermeasures in place, but the predicted value says otherwise.

Conclusion

As mentioned, the 88 percent accuracy figure obtained for the MLP inflates the goodness of fit given that none of the specific instances of cruise=2 (high cruising) were predicted correctly. This is problematic given that it is in this situation that the public may be sensitive to Type II errors or false negatives (no cruising) when there is indeed cruising. The research team surmises that determinants of cruising are localized (temporally) events, which were not adequately reflected for the present analysis. For example, it is difficult to have decent predictive powers when covariates are based on average values over time—e.g., AADT—or when values used have no temporal association to the present. Policy effects are nuanced—for example, the research team found a statistically significant increase in distance cruised when the parking meters were switched off in Seattle though this happened at a time when total trips were about one third of prior trip making, between March and April 2020 when the city was on lockdown. G-ICE did not successfully project this outcome.

²⁹ A perceptron is a linear classifier that takes input regressors and generates a single binary output via an activation function that is triggered when the cumulative sum of the weighted input regressors exceed a specific threshold. [ Return to Note 29 ]