Background: Therapy-related heart failure (HF) is a leading cause of morbidity and mortality in patients who undergo successful HCT for hematological malignancies. To eventually help improve management of HF, timely and accurate diagnosis of HF is crucial. Currently, no established method for diagnosis of HF exist. One approach to help improve diagnosis and management of HF is to use predictive modeling to assess the likelihood of HF, using key predictors known to be associated with HF. Such models can, in turn, be used for bedside management, including implementation of early screening approaches. That said, many techniques for predictive modeling exist. Currently, it is not known if Artificial Intelligence machine learning (ML) approaches are superior to standard statistical techniques for the development of predictive models for clinical practice. Here we present a comparative analysis of traditional multivariable models with ML learning predictive modeling in an attempt to identify the best predictive model for diagnosis of HF after HCT.
Methods: At City of Hope, we have established a large prospective cohort (>12,000 patients) of HCT survivors (HCT survivorship registry). This registry is dynamic, and interfaces with other registries and databases (e.g., electronically indexed inpatient and outpatient medical records, national death index [NDI]). We utilized natural language processing (NLP) to extract 13 key demographics and clinical data that are known to be associated with HF. For the purposes of this project, we extracted data from 1,834 patients (~15% sample) who underwent HCT between 1994 to 2004 to allow adequate follow-up for the development of HF. We fit and compared 6 models [standard logistic regression (glm), FFT (fast-and-frugal tree) decision model and four ML models: CART (classification and regression trees), SVM (support vector machine), NN (neural network) and RF (random forest)]. Data were randomly split (50:50) into training and validation samples; the ultimate assessment of the best algorithm was based on its performance (in terms of calibration and discrimination) in the validation sample. DeLong test was used to test for statistical differences in discrimination [i.e. area under the curve (AUC)] among the models.
Results: The accuracy of NLP was consistently >95%. Only the standard logistic regression model LR (glm) resulted in a well-calibrated model (Hosmer-Lemeshow goodness of fitness test: p=0.104); all other models were miscalibrated. The standard glm model also had best discrimination properties (AUC=0.704 in training and 0.619 in validation set). CART performed the worst (AUC=0.5). Other ML models (RF, NN, and SVM) also showed modest discriminatory characteristics (AUCs of 0.547, 0.573, and 0.619, respectively). DeLong test indicated that all models outperformed CART model (at nominal p<0.05 ), but were statistically indistinguishable among each other (see Figure). Power of analysis was borderline sufficient for the glm model and very limited for ML models.
Conclusions: None of tested models showed optimal performance characteristics for their use in clinical practice. ML models performed even worse than standard logistic models; given increasing use of ML models in medicine, we caution against the use of these models without adequate comparative testing.
No relevant conflicts of interest to declare.
Author notes
Asterisk with author names denotes non-ASH members.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal