BACKGROUND: Diagnostic and prognostic prediction models often perform poorly when externally validated. The reasons for variation in performance across data samples are not fully understood.
OBJECTIVES: We investigate how differences in the measurement of predictors across settings affect the discriminative power and transportability of a prediction model.
METHODS: Differences in predictor measurement between data sets can be described formally using a “measurement error” taxonomy. Using this taxonomy, we derive an expression relating variation in the measurement of a continuous predictor to the area under the curve (AUC) of a logistic regression prediction model. This expression is then used to demonstrate how variation in measurements across samples affects the out-of-sample discriminative ability of a prediction model. We illustrate these findings with a diagnostic model using example data of patients suspected of having deep vein thrombosis.
RESULTS: When a predictor, such as D-dimer, is measured with more noise in one setting compared to another, which we conceptualize as a difference in “classical measurement error”, the AUC decreases (Fig. 1a). In contrast, constant, “structural”, error does not impact on the AUC of a logistic regression model, providing the magnitude of the error is the same among cases and non-cases (Fig. 1b). As the differences in measurement methods (and in turn differences in measurement error) become more complex, it becomes increasingly difficult to predict how the AUC will be affected.
CONCLUSION: When a prediction model is applied to a new sample, its discriminative ability can change if the magnitude or structure of the measurement error is not exchangeable between the two settings. This provides an important starting point for researchers to better understand how differences in measurement methods can affect the performance of a prediction model when externally validating or implementing it in practice.