• Machine learning models can predict leukemic evolution in patients with acquired AA using retrospective clinical data.

Abstract

Patients with acquired aplastic anemia (AA) treated with immunosuppressive therapy (IST) face up to a 20% long-term risk of developing secondary myeloid neoplasms (sMNs), including acute myeloid leukemia and myelodysplastic syndromes. Although hematopoietic stem cell transplantation (HSCT) is curative and prevents sMNs, older patients and those lacking suitable donors have historically received IST as first-line therapy. Recent improvements in HSCT outcomes have expanded transplant eligibility, highlighting the need for tools to better identify patients at high risk for sMN. Validated predictive models could help guide early HSCT consideration or tailor surveillance strategies. We developed 2 binary machine learning models to predict sMN development in patients with acquired AA at clinically relevant time points: diagnosis (model 1) and 6 months after IST response (model 2). We analyzed data from 275 adult patients with AA treated at University of Texas Southwestern, Cleveland Clinic, and the Hospital of the University of Pennsylvania between 1975 and 2023. Seventy-nine clinical variables were collected, including demographics, somatic mutations, and treatment response. Neural networks were trained with leave-1-out crossvalidation. Both models achieved strong performance (area under the curve, 0.82; sensitivity, 0.82, specificity, 0.73). Shared key predictors included DNMT3A mutation, CUX1 mutation, total mutation count, and age. TET2 mutation was specific to model 1; paroxysmal nocturnal hemoglobinuria clone presence was unique to model 2. High-risk classification was significantly associated with worse overall survival (P < .0001). These findings support the feasibility of machine learning–based sMN risk prediction in AA. With training on larger data sets and external validation, these models may support individualized decision-making around HSCT and post-IST surveillance.

Acquired aplastic anemia (AA) is a rare and life-threatening disorder characterized by immune-mediated destruction of hematopoietic stem and progenitor cells.1 A major long-term complication of AA is the development of secondary myeloid neoplasms (sMNs), such as acute myeloid leukemia (AML) and myelodysplastic syndrome (MDS), which account for a substantial proportion of treatment-related mortality. Among patients receiving immunosuppressive therapy (IST), ∼15% to 20% will eventually experience malignant transformation.2-5 

The standard frontline therapies for severe aplastic anemia (SAA) include IST and hematopoietic stem cell transplantation (HSCT). IST, typically combining horse antithymocyte globulin and cyclosporine A,6 with eltrombopag frequently added to enhance hematologic response, remains the initial treatment of choice for most patients without a matched sibling donor, particularly those aged ≥40 years or with significant comorbidities. Although IST can lead to hematologic recovery in most patients, a subset experience relapse, require prolonged immunosuppression, or develop clonal evolution. In contrast, HSCT offers the potential for long-term hematopoietic reconstitution and is generally considered curative of sMN risk in appropriately selected patients. Historically reserved for younger patients with matched sibling donors, HSCT is increasingly being considered in a broader range of patients due to advances in donor matching, supportive care, and conditioning regimens.7 Current clinical guidelines recommend IST as the preferred initial treatment for older patients and those with significant comorbidities, although HSCT being considered as an increasingly feasible option in high-risk individuals and in those with high-risk molecular features.8-10 

Clonal evolution in acquired AA is driven by autoimmune pressure from cytotoxic T cells, which selectively favors hematopoietic stem and progenitor cells harboring somatic mutations or cytogenetic abnormalities.11 Specific gene mutations (eg, ASXL1, RUNX1, DNMT3A, TET2, and BCOR) and chromosomal abnormalities (eg, del(Y), +8, 6p CN-LOH) are frequently observed in bone marrow studies of patients with AA and have highly variable prognostic implications for progression-free survival (PFS)12,13 are frequently observed in bone marrow studies of patients with AA, which have highly variable prognostic implications in PFS. Previous studies have also identified demographic and treatment-related factors, along with clonal genetic and cytogenetic alterations, that are associated with either increased risk of sMN or protective effects against malignant transformation, informing clinical decision-making.14-17 However, somatic mutations may also be present at diagnosis or emerge over time without signifying imminent transformation and physicians are advised to interpret with caution.18 No validated predictive models currently exist to accurately estimate an individual patient’s risk of malignant progression.

In this study, we present 2 machine learning models trained on the clinical data of adult acquired patients with AA. Model 1 is designed to assess a patient’s sMN risk using clinical data routinely obtained during the diagnostic workup, whereas model 2 is designed to reassess sMN after 6 months of first-line IST. Such an approach may support future efforts to individualize treatment selection, including consideration of upfront HSCT in carefully selected cases.

Study design and patient selection

We collected a comprehensive multi-institutional data set of patients with AA for use in training machine learning models to predict the incidence of sMNs in patients with AA. The initial cohort included 350 adult patients treated for AA at the University of Texas Southwestern Medical Center, the Cleveland Clinic Foundation, and the Hospital of the University of Pennsylvania between 1975 and 2023.

Patients were excluded if they had sMNs detected at the time of AA diagnosis (n = 18). Additional exclusions were applied for those lost to follow-up, patients with <180 days of transplant-free survival (n = 19), and those who did not complete a standard course of first-line IST (n = 38). This minimized bias from incomplete data, nonstandardized treatment, short-term survival, and early transplantation. After applying these criteria, the final cohort for analysis consisted of 275 patients.

Data collection

Seventy-nine variables relevant to the diagnosis, treatment, and prognosis of AA were collected (Table 1). These variables included demographic information, clinical presentation, laboratory findings, and treatment responses. AA severity was defined using the Camitta criteria. Clinical diagnoses, including AA, AA/paroxysmal nocturnal hemoglobinuria (PNH) overlap syndrome, AML, and MDS, were confirmed through detailed pathologic analysis of bone marrow biopsies, aspirates, and peripheral blood samples.

Diagnostic modalities included cytomorphology, histomorphology, iron staining, chromosome banding, single-nucleotide polymorphism array karyotyping, fluorescence in situ hybridization, and next-generation sequencing (NGS) for myeloid mutation panels. To rule out potential inherited etiologies of AA, patients underwent comprehensive evaluations, including physical examination findings, family history reviews, chromosome breakage studies, lymphocyte telomere length measurements, and NGS panels targeting genes associated with inherited bone marrow failure syndromes. Diagnoses of AA and MDS were retrospectively reviewed and confirmed using the 2016 World Health Organization classification system for myeloid malignancies, ensuring accurate distinction between true sMN cases and nonmalignant clones. As a notable exception to this, we qualified patients harboring clones with chromosome 13q deletions as having AA, not MDS-unclassifiable, in consideration of several studies qualifying this as a favorable prognostic marker and benign karyotype abnormality in acquired aplastic anemia.22-26 PNH clone size was determined by the percent of glycosylphosphatidylinositol-deficient granulocyte cells. A detailed description of the study design, research protocols, patient selection criteria, and variable definitions is provided in the supplemental Materials.

Genomic and cytogenetic analysis

Bone marrow aspirate samples were analyzed using multiple NGS and cytogenetic platforms, each with varying sequencing depths, gene coverage, and limits of detection. All clinical and research laboratories are clinical laboratory improvement amendments certified. Reported variants from myeloid NGS studies were independently reviewed using the VarSome Somatic Variant Classifier27 (Varsome.com) to verify classification accuracy and incorporate any variation classification updates. Detailed descriptions of the genomic analyses, as well as a table containing the individual mutations (supplemental Table 1), are available in the supplemental Methods.

Data preprocessing

Several data preprocessing techniques were implemented, including mutation grouping, data binning, and imputation methods. Somatic mutation data obtained from several clinical and experimental hematopathology NGS panels, were standardized to maximize data utility and create standardized data sets suitable for model training. Only variants classified as pathogenic or likely pathogenic in a curated list of 41 recurrent driver genes shared across all platforms (supplemental Table 1) were included in the training data set and downstream analyses. Variants that were confirmed as pathogenic or likely pathogenic and exceeded both the test-specific limits of detection, and a uniform variant allele frequency (VAF) threshold of 1% (VAF > 0.01) were assigned a value of 1. All other variants, those classified as benign, likely benign, of uncertain significance, or with a VAF below the detection threshold or below 1%, were assigned a value of 0. BCOR and BCORL1 were merged into a single variable (BCOR/L1) to maximize data points and enhance statistical power by reducing missingness. Variables with >30% missing data were excluded from the analysis to ensure data integrity.28-30 Details on data preprocessing are available in the supplemental Materials.

Model training and validation

For variables of which <30% of the cohort’s data were missing, iterative imputation was used to maximize available data and improve model fitness. Missing values, comprising 24% of the data set, were imputed with the multiple imputation by chain equation31 algorithm. Missing values of all predictor variables were included in the imputation process. All features were then scaled to a range of −1 to 1 to support efficient neural network training. A multilayer perceptron (MLP) with 5 hidden layers (32, 16, 8, 4, and 2 neurons) was used for binary classification.32 To address class imbalance (43 sMN vs 232 non-sMN), the positive class was assigned a weight of 7.2, selected to optimize macro area under the curve (AUC) while maintaining sensitivity and specificity of >0.7. Random forest feature importance scores were iteratively computed (N = 10) to rank predictors.33 The MLP was trained via backpropagation and selected for its capacity to model complex interactions among tabular input features.

The MLP architecture comprised 5 hidden layers, each using the Leaky ReLU activation function (α = 0.1). Leaky ReLU helps mitigate the vanishing gradient problem by allowing a small gradient even when the unit’s output is <0. The output layer had 2 units and used the SoftMax activation function, which produced output probabilities for both classes (eg, 0.71, 0.29). Each model was trained for 150 epochs, with early stopping triggered after 20 epochs (s = 20) without improvement.

Leave-1-out (LOO) crossvalidation was used to validate the model. In LOO, 1 sample at a time is designated as the validation set, while the remaining 274 samples are used for training.34 This process repeats until each of the 275 patients has served as the validation set exactly once. Notably, running LOO crossvalidation with 5 to 21 features for the 275 patients required ∼75 seconds on a standard desktop computer (3.5 MHz Intel i7, 12 gigabytes random-access memory, Scikit-learn version 1.3). The final classification assigns a patient to the high-risk category if the prediction score exceeds 0.5 and to the low-risk category otherwise.

Statistical analysis and software tools

Statistical analyses and data visualizations were conducted using Python (version 3.8.5). Cumulative incidence functions were plotted and competing risk analyses were conducted using SAS Viya (version 3.8.1). Deep learning tasks were performed with TensorFlow (version 2.4.1), using Keras (version 2.4.3) for model construction and training. Scikit-learn (version 0.24.1) was used for key machine learning tasks, including data imputation, feature selection, and model evaluation. The “multiple imputation by chain equation” algorithm was fulfilled using Scikit-learn’s IterativeImputer function.35 Data manipulation and numerical computations were performed using Pandas (version 1.2.1) and NumPy (version 1.19.5). To prevent overfitting during model training, the early stopping method was implemented. Data visualization was carried out using Matplotlib (version 3.3.3). Kaplan-Meier curves were generated with the lifelines package (version 0.30.0), and statistical significance for survival analyses was assessed using scikit-survival (version 0.23.1). To evaluate multicollinearity among covariates, the variance inflation factor was calculated with Stats models (version 0.12.2). Time-dependent AUC values were computed using sksurv (version 0.16) to assess model performance over time.

All patients provided written informed consent, and the study protocol was approved by the institutional review boards of each participating institution. All research activities were conducted in accordance with the ethical principles outlined in the Declaration of Helsinki.

Of 275 patients included in the study, 222 (80.7%) had SAA or very SAA (VSAA), and 53 (19.3%) had nonsevere AA. sMNs developed in 40 (18.0%) patients with SAA/VSAA and in 4 (7.6%) patients with nonsevere AA, with a higher risk observed in the SAA/VSAA group (risk ratio, 2.39; odds ratio, 2.69; P = .064). The cohort comprised 139 females (50.5%), and had a median age of 54.0 years (range, 18.0-89.4; interquartile range [IQR], 34.4-66.3; mean ± standard deviation [SD]: 50.8 ± 18.5 years). A total of 188 patients (68.3%) achieved either a complete or partial response to therapy. Among 230 patients with available karyotype data, 8 (3.5%) had abnormal karyotypes at diagnosis (Table 2). Although most patients were enrolled at a single center, sMN incidence did not differ significantly across institutions (χ2 = 1.95; P = .377), reducing concern for treatment-related bias despite differences in enrollment rates.

PNH clones were detected in 111 patients (40.4%), with a median granulocyte clone size of 1.2% (range, 0.01-93). Of these, 69 patients had clone sizes of ≥1%, and 15 had clone sizes of ≥20%. Among patients who did not develop sMN, 98 (35%) harbored detectable PNH clones (Table 2). Baseline somatic mutation data were available for 195 patients. Among them, 28 (14.4%) had 1 mutation, and 8 (4.1%) had ≥2 mutations. The most frequently observed mutations were ASXL1 (n = 7 [3.6%]), TET2 (n = 7 [3.6%]), BCOR or BCORL1 (n = 6 [3.1%]), CUX1 (n = 5 [2.6%]), RUNX1 (n = 5 [2.6%]), and DNMT3A (n = 4 [2.1%]; supplemental Materials; Table 2).

Model 1 was trained using 23 baseline clinical and molecular features (Table 2; supplemental Figure 1). The 5 most predictive variables for sMN development were CUX1 mutation, DNMT3A mutation, TET2 mutation, total mutation count, and patient age at diagnosis (Figure 1A-B). Model 1 achieved an AUC of 0.82, with a sensitivity of 0.81, specificity of 0.74, positive predictive value (PPV) of 35.0%, and negative predictive value of 95.5% (Figure 1C). The confusion matrix (Figure 1D) revealed 36 true positives (13.0%), 171 true negatives (62.0%), 61 false positives (22.1%), and 8 false negatives (2.9%). The receiver operating characteristic (ROC) curve demonstrated strong discriminatory ability (Figure 1E).

Figure 1.

Feature impact scores, performance metrics, and validation of machine learning model 1 for predicting sMNs. (A) Bar plot of the feature impact scores for model 1, demonstrating the relative contribution of each feature to the model’s performance. (B) Multivariate regression (R) scores and P values of variables used in model 1, with asterisks (∗) denoting statistical significance (P < .05). (C) Summary of model 1 performance metrics, including sensitivity (0.81), specificity (0.74), and an AUC of 0.82. (D) Confusion matrix displaying true positives (36 [13.0%]), true negatives (171 [62.0%]), false positives (61 [22.1%]), and false negatives (8 [2.9%]). (E) AUC-ROC curve illustrating the model’s predictive performance with an AUC of 0.82, indicating strong discriminatory ability for identifying sMN risk.

Figure 1.

Feature impact scores, performance metrics, and validation of machine learning model 1 for predicting sMNs. (A) Bar plot of the feature impact scores for model 1, demonstrating the relative contribution of each feature to the model’s performance. (B) Multivariate regression (R) scores and P values of variables used in model 1, with asterisks (∗) denoting statistical significance (P < .05). (C) Summary of model 1 performance metrics, including sensitivity (0.81), specificity (0.74), and an AUC of 0.82. (D) Confusion matrix displaying true positives (36 [13.0%]), true negatives (171 [62.0%]), false positives (61 [22.1%]), and false negatives (8 [2.9%]). (E) AUC-ROC curve illustrating the model’s predictive performance with an AUC of 0.82, indicating strong discriminatory ability for identifying sMN risk.

Close modal

Model 2, which incorporated 29 features including 6-month treatment response, identified DNMT3A mutation, age at diagnosis, PNH clone presence, CUX1 mutation, and total mutation count as the top predictors (Figures 2A-B and 3C-D). Model 2 also achieved an AUC of 0.82, with a sensitivity of 0.84, specificity of 0.73, PPV of 36.7%, and negative predictive value of 95.5% (Figure 2C). The confusion matrix included 36 true positives (13.0%), 169 true negatives (61.5%), 62 false positives (22.5%), and 8 false negatives (2.9%; Figure 2D). ROC analysis confirmed strong performance (Figure 2E).

Figure 2.

Feature impact scores, performance metrics, and validation of machine learning model 2 for predicting sMNs. (A) Bar plot of the feature impact scores for model 2, demonstrating the relative contribution of each feature to the model’s performance. (B) Multivariate regression (R) scores and P values of variables used in model 2, with asterisks denoting statistical significance (P < .05). (C) Summary of model 2 performance metrics, including sensitivity (0.84), specificity (0.73), and an AUC of 0.82. (D) Confusion matrix displaying true positives (36 [13.0%]), true negatives (169 [61.5%]), false positives (62 [22.5%]), and false negatives (8 [2.9%]). (E) AUC-ROC curve illustrating the model’s predictive performance with an AUC of 0.82, indicating strong discriminatory ability for identifying sMN risk. ATG, antithymocyte globulin.

Figure 2.

Feature impact scores, performance metrics, and validation of machine learning model 2 for predicting sMNs. (A) Bar plot of the feature impact scores for model 2, demonstrating the relative contribution of each feature to the model’s performance. (B) Multivariate regression (R) scores and P values of variables used in model 2, with asterisks denoting statistical significance (P < .05). (C) Summary of model 2 performance metrics, including sensitivity (0.84), specificity (0.73), and an AUC of 0.82. (D) Confusion matrix displaying true positives (36 [13.0%]), true negatives (169 [61.5%]), false positives (62 [22.5%]), and false negatives (8 [2.9%]). (E) AUC-ROC curve illustrating the model’s predictive performance with an AUC of 0.82, indicating strong discriminatory ability for identifying sMN risk. ATG, antithymocyte globulin.

Close modal

DeepSHAP was applied to obtain both local and global explanations for the neural network models. Local contributions are depicted in beeswarm summary plots (Figure 3A,C), which map individual SHAP (Shapley additive explanations) values for each feature, colored by feature magnitude, to indicate how varying feature values influence risk predictions. Global importance is presented via mean absolute SHAP value bar plots with hierarchical clustering (Figure 3B,D), in which features are sorted by average contribution magnitude and clusters are delineated at a linkage distance cutoff of 0.50 to group highly redundant features for visualization. In model 1, mean absolute SHAP values for CUX1 mutation (0.99), age at diagnosis (0.77), and PNH clone size (0.50; binary presence of >0% at 0.46) exceeded the clustering threshold, whereas all remaining features displayed lower mean values. In model 2, mean values for age (0.78), DNMT3A mutation (0.70), U2AF1 mutation (0.67), and PNH clone size (0.54) were above the threshold, followed by treatment-response and additional genomic variables forming a secondary cluster; features outside these clusters exhibited lower mean absolute SHAP values. The clustering cutoff serves solely as a visualization aid and does not imply statistical testing. The observed difference in SHAP values across models reflects a shift in feature contribution due to the inclusion of dynamic variables, rather than a reversal of predictive direction or a fundamental mechanistic change.

Figure 3.

SHAP analysis of feature contributions in machine learning models predicting sMN development in AA. SHAP summary (beeswarm) plots for model 1 (A) and model 2 (C). Each point represents the SHAP value of a feature for an individual observation. Features are ranked by overall importance. The x-axis indicates the SHAP value (ie, the impact of that feature on the model’s output for a given patient), whereas color reflects the feature value (blue = low, red = high). Features to the right of x = 0 were associated with increased predicted risk of sMN, whereas features to the left were associated with reduced risk. Mean absolute SHAP value plots with hierarchical clustering of features for model 1 (B) and model 2 (D). These bar plots quantify the global importance of each feature based on the magnitude of its contribution to model predictions. Features with low mean SHAP values were considered to have minimal impact and were excluded from subsequent clustering. “Remaining 3 features” in panel D refers to “Abnormal Karyotype,” “Del(13q) Karyotype,” and “Other IST Treatment,” which were retained but contributed minimally. ATG, antithymocyte globulin.

Figure 3.

SHAP analysis of feature contributions in machine learning models predicting sMN development in AA. SHAP summary (beeswarm) plots for model 1 (A) and model 2 (C). Each point represents the SHAP value of a feature for an individual observation. Features are ranked by overall importance. The x-axis indicates the SHAP value (ie, the impact of that feature on the model’s output for a given patient), whereas color reflects the feature value (blue = low, red = high). Features to the right of x = 0 were associated with increased predicted risk of sMN, whereas features to the left were associated with reduced risk. Mean absolute SHAP value plots with hierarchical clustering of features for model 1 (B) and model 2 (D). These bar plots quantify the global importance of each feature based on the magnitude of its contribution to model predictions. Features with low mean SHAP values were considered to have minimal impact and were excluded from subsequent clustering. “Remaining 3 features” in panel D refers to “Abnormal Karyotype,” “Del(13q) Karyotype,” and “Other IST Treatment,” which were retained but contributed minimally. ATG, antithymocyte globulin.

Close modal

Kaplan-Meier analysis demonstrated significantly reduced sMN-free survival among high-risk patients as stratified by both models (supplemental Figure 3A-B). The log-rank test statistic was 37.56 (P = 8.86 × 10−10) for model 1 and 43.53 (P = 4.17 × 10−11) for model 2. Competing risks analysis showed a cumulative sMN incidence of 4.92% at 2 years, 21.74% at 5 years, 57.38% at 10 years, and ∼65% at 15 years (supplemental Figure 2A-B). Stratification by model-defined risk groups confirmed markedly increased incidence in high-risk patients. At 10 and 15 years, high-risk patients had cumulative incidences of 58.2% and 67.5%, respectively, by model 1, and 54.6% and 63.4%, respectively, by model 2, compared with 15.2% and 17.5% (model 1) and 11.7% and 13.4% (model 2), respectively, in the low-risk group (Gray test χ2 = 35.50 for model 1; and χ2 = 40.14 for model 2; P < .001 for both). Correlation and covariance matrices for all features used in model training are provided in supplemental Figure 4A-D.

A total of 44 patients (16.0%) developed sMNs; 38 with MDSs and 6 with AML. The median time from AA diagnosis to sMN onset was 3.93 years (IQR, 2.06-6.63; mean ± SD: 4.50 ± 3.25 years; range, 0.13-14.86). Median latency to AML onset was 2.96 years (IQR, 1.70-5.04), which was similar to latency to MDS, 4.44 years (IQR, 2.21-7.03; Mann-Whitney U test, P = .188).

The median follow-up for the overall cohort was 4.19 years (IQR, 7.78; mean ± SD: 6.58 ± 6.61 years). Median follow-up durations by subgroup were: 2.37 years (IQR, 1.45-7.27) for patients who remained alive, 3.93 years (IQR, 2.06-6.63) for patients who developed sMN, and 3.60 years (IQR, 1.75-6.02) for patients who underwent HSCT. Missing follow-up data were noted in 6.91% of all patients, including 5.96% of patients without sMN and 12.5% of those who developed sMN.

Mann-Whitney U testing showed that patients who died had significantly shorter follow-up compared with those who remained alive (P = .0016), developed sMN (P = .023), or underwent HSCT (P = .025). There were no significant differences in follow-up time between patients who developed sMN and underwent transplant (P = .978), developed sMN and remained alive (P = .467), or underwent transplant and remained alive (P = .490).

In this study, we developed 2 binary classification machine learning models to identify adult patients with acquired AA at high risk for developing sMN using routinely collected clinical and molecular features.

Model 1 was trained using baseline diagnostic features and achieved an AUC of 0.82. The top predictive variables included CUX1 mutation, DNMT3A mutation, TET2 mutation, PNH clone presence, total somatic mutation count, and patient age at diagnosis, features broadly consistent with previous literature implicating clonal hematopoiesis and epigenetic dysregulation in leukemic evolution.2,13,15,36,37 Although CUX1 emerged as a highly predictive feature in our model, it is considered a less frequent mutation in comparison with more commonly reported high-risk mutations such as ASXL1, RUNX1, and SETBP1. Previous studies more suited for pathophysiologic interpretation suggest that CUX1 mutations may represent early, unstable events that disappear at transformation, often supplanted by −7/del(7q).12 In MNs, CUX1 mutations are frequently subclonal, enriched in older patients, co-occur with adverse cytogenetics and high mutational burden, and are often underrecognized in standard panels due to low VAF.38 It should be noted that our study design does not allow mechanistic inference on clonal evolution.

Model 2 incorporated treatment response data from 6 months after diagnosis to reassess sMN risk dynamically. This model achieved similar performance (AUC of 0.82) and relied on many of the same top predictors: DNMT3A, CUX1, total mutation count, age, and PNH clone presence. The positive association of driver mutations and age, as well as the protective effect of PNH against sMN development is consistent with previous studies.11,12,14,36 Importantly, this time point captures a clinically relevant juncture in AA management; ∼30% of patients do not respond to horse antithymocyte globulin–based IST, and among these, ∼15% progress to MDS or AML. In practice, not all nonresponders proceed directly to transplant due to factors such as donor availability, comorbidities, or institutional policy. Model 2 may help stratify these patients by sMN risk and facilitate more informed decisions regarding expedited transplant referral or intensified surveillance, even beyond response categorization alone.18 

To the best of our knowledge, this effort represents the first validated machine learning-based approach for individualized risk prediction of sMN in acquired AA. Notably, Yoshizato et al used penalized variable selection and random survival forests for feature selection in a cohort of 256 National Institute of Health patients to identify gene combinations associated with IST response, overall survival, and PFS.14 Although their analysis revealed that BCOR and BCORL1 mutations conferred favorable outcomes and DNMT3A, ASXL1, RUNX1, JAK2, and JAK3 conferred worse PFS (P < .03), it was not intended for patient-specific risk estimation. In contrast, our methodology was explicitly developed to generate individualized outcome predictions rather than infer mechanistic insight.

Although AA-associated sMN is often described as a long-term complication, recent studies reveal a persistently elevated and nonplateauing risk over time. In our cohort, 44 patients (16%) developed sMN; 36 patients within 5 years, and 10 within 2 years. The cumulative incidence of sMN reached 21.7% at 5 years and 57.4% at 10 years. Risk was notably higher with age; previous studies have reported 10-year cumulative incidences of 20.6% for patients aged >35 years vs 6.6% for those aged 15 to 35 years at diagnosis.12 Our cohort was consistent in terms of sMN incidence rates and median age to onset from initial diagnosis, which underscores the need for predictive tools to enable proactive intervention.12,13,39 

Currently, allogeneic HSCT remains the only curative treatment for both AA and its clonal complications. Long-term follow-up studies suggest that MDS/AML risk after transplantation is negligible.40 Early transplantation, ideally within 6 months of AA diagnosis, has been associated with improved clinical outcomes.18,41 In matched related donor transplants, delays beyond 6 months have been linked to significantly increased risks of graft-versus-host disease and relapse-free survival failure (hazard ratio, 4.08; 95% confidence interval, 1.41-11.83; P = .010). Transplantation after overt progression to MDS or AML introduces additional risks related to chemotherapy toxicity, relapse, and poorer survival outcomes. Five-year overall survival after transplantation for post-AA MDS or AML has been reported at ∼62%, compared with ∼23% with chemotherapy or supportive care alone (P < .01).13 Despite its curative potential, transplantation carries substantial risks, and graft-versus-host disease and relapse-free survival rates in adults remain limited. Thus, individualized risk prediction remains critically important for guiding decisions regarding transplant timing and patient selection.

Although our study demonstrates the feasibility of using machine learning to forecast sMN risk in acquired AA, it is constrained by data limitations. The modest sample size limits generalizability and precludes robust stratification by age, disease severity, or donor availability. Although both models achieved strong discriminative performance (AUC of ∼0.82), their PPVs were modest (32%-36%), consistent with the relatively low long-term incidence of sMN (∼15%-20%). Unlike AUC, which is invariant to outcome prevalence,42-44 PPV is directly influenced by disease incidence and is expected to improve with model training on larger data sets. We prioritized standard performance metrics such as AUC, ROC curves, sensitivity, and specificity to assess the models’ ability to discriminate between patients who did and did not develop sMN, because these metrics provide prevalence-independent evaluations of model performance.45,46 Kaplan-Meier and cumulative incidence function analyses (supplemental Figures 2A-B and 3A-B) demonstrated significant early divergence in sMN incidence between model-defined high- and low-risk groups. For both model 1 and model 2, log-rank tests showed strong statistical significance (P < .001), supporting the potential clinical utility of these models in identifying patients at elevated risk of malignant transformation early in the disease course.

Another key limitation is the limited interpretability of the models due to the “black box” nature of deep learning algorithms.47 We addressed this by including SHAP (Figure 3), regression scores (Figures 1B and 2B), multivariate hazard ratios (Table 3), as well as correlations and covariance matrices (supplemental Figure 4A-B) to help analyze the effect size of each feature across many methods of analysis.48,49 

The data set used to train model 1 did not include data on immunosuppressive treatments, which may introduce bias into the model due to implicit correlations between treatment patterns and sMN risk. Model 2 would not suffer from this data bias, which is expected to be minimal due to the similarity in performance of the 2 models. We used LOO crossvalidation to minimize bias and fully leverage a multicenter data set,50-52 yet this internal validation approach cannot replace testing in an independent cohort. Given the rarity of AA, many studies face similar constraints in assembling large, external data sets. Nonetheless, external validation will be a critical next step in the development of clinically robust and generalizable prediction models.

In summary, our study provides proof of concept for machine learning–based risk prediction of sMN in acquired AA using clinically accessible variables. Future studies should focus on training with larger data sets, external validation. With refinement, such models may eventually support risk stratification, inform transplant decisions, and facilitate personalized care. However, until prospective studies, including randomized trials, demonstrate added clinical utility beyond physician judgment, we cannot recommend how such tools should be used in practice.

The authors thank the University of Texas Southwestern Medical Center, the Cleveland Clinic, and the University of Pennsylvania for their contributions to this research. The authors thank their research coordinator, Kasia Harrah, for her valuable contributions to this study.

Contribution: A.C.T. contributed to conceptualization, methodology, coding, validation, statistical analysis, investigation, data curation, writing the manuscript, visualization, project administration, and funding acquisition; J.J.S. contributed to writing the manuscript, visualization, and statistical analysis; M.M. contributed to methodology, software, validation, formal analysis, resources, writing the manuscript, and visualization; C.I.A.S. contributed to writing the manuscript and statistical analysis; C.G. contributed to validation, investigation, resources, data curation, writing the manuscript, supervision, project administration, and funding acquisition; J.P.M. contributed to validation, investigation, resources, data curation, writing the manuscript, supervision, project administration, and funding acquisition; D.V.B. contributed to resources, investigation, and writing the manuscript; Z.B. contributed to investigation and writing the manuscript; Z.T., M.N.D., H.A., and I.I. contributed to writing the manuscript; Y.O. and L.G. contributed to visualization and writing the manuscript; and T.B. contributed to conceptualization, methodology, validation, investigation, resources, writing the manuscript, visualization, supervision, project administration, and funding acquisition. A.O. contributed to data visualization. J.A.T. contributed to writing manuscript.

Conflict-of-interest disclosure: T.B. reports advisory committee/board membership with Alexion, Novartis, Samsung Bioepis, Omeros, and Recordati Rare Disease. J.P.M. reports honoraria from, and speakers bureau role with, Novartis; advisory committee/board membership with Alexion; consultancy with, and honoraria from, Regeneron; and consultancy with Omeros. D.B. reports consultancy with Retro Biosciences. The remaining authors declare no competing financial interests.

Correspondence: Taha Bat, Division of Hematology-Oncology, Department of Internal Medicine, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9255; email: taha.bat@utsouthwestern.edu.

1.
Young
NS
,
Maciejewski
J
.
The pathophysiology of acquired aplastic anemia
.
N Engl J Med
.
1997
;
336
(
19
):
1365
-
1372
.
2.
Sun
L
,
Babushok
DV
.
Secondary myelodysplastic syndrome and leukemia in acquired aplastic anemia and paroxysmal nocturnal hemoglobinuria
.
Blood
.
2020
;
136
(
1
):
36
-
49
.
3.
de Planque
MM
,
Bacigalupo
A
,
Würsch
A
, et al
.
Long-term follow-up of severe aplastic anaemia patients treated with antithymocyte globulin
.
Br J Haematol
.
1989
;
73
(
1
):
121
-
126
.
4.
Rosenfeld
S
,
Follmann
D
,
Nunez
O
,
Young
NS
.
Antithymocyte globulin and cyclosporine for severe aplastic anemia: association between hematologic response and long-term outcome
.
JAMA
.
2003
;
289
(
9
):
1130
-
1135
.
5.
Frickhofen
N
,
Heimpel
H
,
Kaltwasser
JP
,
Schrezenmeier
H
;
German Aplastic Anemia Study Group
.
Antithymocyte globulin with or without cyclosporin A: 11-year follow-up of a randomized trial comparing treatments of aplastic anemia
.
Blood
.
2003
;
101
(
4
):
1236
-
1242
.
6.
Bacigalupo
A
.
How I treat acquired aplastic anemia
.
Blood
.
2017
;
129
(
11
):
1428
-
1436
.
7.
Rice
C
,
Eikema
D-J
,
Marsh
JCW
, et al
.
Allogeneic hematopoietic cell transplantation in patients aged 50 years or older with severe aplastic anemia
.
Biol Blood Marrow Transplant
.
2019
;
25
(
3
):
488
-
495
.
8.
Sureda
A
,
Bacigalupo
A
,
Boogaerts
M
, et al
.
The EBMT Handbook: Hematopoietic Stem Cell Transplantation and Cellular Therapies
. 7th ed.
Springer
;
2019
.
9.
Iftikhar
R
,
DeFilipp
Z
,
DeZern
AE
, et al
.
Allogeneic hematopoietic cell transplantation for the treatment of severe aplastic anemia: evidence-based guidelines from the American society for transplantation and cellular therapy
.
Transplant Cell Ther
.
2024
;
30
(
12
):
1155
-
1170
.
10.
Wirk
B
.
Acquired aplastic anemia therapies: immunosuppressive therapy versus alternative donor hematopoietic cell transplantation
.
J Hematol
.
2024
;
13
(
3
):
61
-
70
.
11.
Negoro
E
,
Nagata
Y
,
Clemente
MJ
, et al
.
Origins of myelodysplastic syndromes after aplastic anemia
.
Blood
.
2017
;
130
(
17
):
1953
-
1957
.
12.
Gurnari
C
,
Pagliuca
S
,
Prata
PH
, et al
.
Clinical and molecular determinants of clonal evolution in aplastic anemia and paroxysmal nocturnal hemoglobinuria
.
J Clin Oncol
.
2023
;
41
(
1
):
132
-
142
.
13.
Groarke
EM
,
Patel
BA
,
Shalhoub
R
, et al
.
Predictors of clonal evolution and myeloid neoplasia following immunosuppressive therapy in severe aplastic anemia
.
Leukemia
.
2022
;
36
(
9
):
2328
-
2337
.
14.
Yoshizato
T
,
Dumitriu
B
,
Hosokawa
K
, et al
.
Somatic mutations and clonal hematopoiesis in aplastic anemia
.
N Engl J Med
.
2015
;
373
(
1
):
35
-
47
.
15.
Babushok
DV
,
Perdigones
N
,
Perin
JC
, et al
.
Emergence of clonal hematopoiesis in the majority of patients with acquired aplastic anemia
.
Cancer Genet
.
2015
;
208
(
4
):
115
-
128
.
16.
Kulasekararaj
AG
,
Jiang
J
,
Smith
AE
, et al
.
Somatic mutations identify a subgroup of aplastic anemia patients who progress to myelodysplastic syndrome
.
Blood
.
2014
;
124
(
17
):
2698
-
2704
.
17.
Zaimoku
Y
,
Takamatsu
H
,
Hosomichi
K
, et al
.
Identification of an HLA class I allele closely involved in the autoantigen presentation in acquired aplastic anemia
.
Blood
.
2017
;
129
(
21
):
2908
-
2916
.
18.
Kulasekararaj
A
,
Cavenagh
J
,
Dokal
I
, et al
.
Guidelines for the diagnosis and management of adult aplastic anaemia: a British Society for Haematology Guideline
.
Br J Haematol
.
2024
;
204
(
3
):
784
-
804
.
19.
Camitta
BM
,
Rappeport
JM
,
Parkman
R
,
Nathan
DG
.
Selection of patients for bone marrow transplantation in severe aplastic anemia
.
Blood
.
1975
;
45
(
3
):
355
-
363
.
20.
Scheinberg
P
,
Nunez
O
,
Weinstein
B
, et al
.
Horse versus rabbit antithymocyte globulin in acquired aplastic anemia
.
N Engl J Med
.
2011
;
365
:
430
-
438
.
21.
Scheinberg
P
,
Young
NS
.
How I treat acquired aplastic anemia
.
Blood
.
2012
;
120
(
6
):
1185
-
1196
.
22.
Hosokawa
K
,
Katagiri
T
,
Sugimori
N
, et al
.
Favorable outcome of patients who have 13q deletion: a suggestion for revision of the WHO 'MDS-U' designation
.
Haematologica
.
2012
;
97
(
12
):
1845
-
1849
.
23.
Ishiyama
K
,
Karasawa
M
,
Miyawaki
S
, et al
.
Aplastic anaemia with 13q-: a benign subset of bone marrow failure responsive to immunosuppressive therapy
.
Br J Haematol
.
2002
;
117
(
3
):
747
-
750
.
24.
Maciejewski
JP
,
Risitano
A
,
Sloand
EM
,
Nunez
O
,
Young
NS
.
Distinct clinical outcomes for cytogenetic abnormalities evolving from aplastic anemia
.
Blood
.
2002
;
99
(
9
):
3129
-
3135
.
25.
Holbro
A
,
Jotterand
M
,
Passweg
JR
,
Buser
A
,
Tichelli
A
,
Rovó
A
.
Comment to “favorable outcome of patients who have 13q deletion: a suggestion for revision of the WHO ‘MDS-U’ designation”
.
Haematologica
.
2013
;
98
(
4
):
1845
-
1849
.
26.
Litzow
MR
,
Kyle
RA
.
Multiple responses of aplastic anemia to low-dose cyclosporine therapy despite development of a myelodysplastic syndrome
.
Am J Hematol
.
1989
;
32
(
3
):
226
-
229
.
27.
Kopanos
C
,
Tsiolkas
V
,
Kouris
A
, et al
.
VarSome: the human genomic variant search engine
.
Bioinformatics
.
2019
;
35
(
11
):
1978
-
1980
.
28.
Barrabés
M
,
Perera
M
,
Novelle Moriano
V
,
Giró-I-Nieto
X
,
Mas Montserrat
D
,
Ioannidis
AG
.
Advances in biomedical missing data imputation: a survey
.
IEEE Access
.
2025
;
13
:
16918
-
16932
.
29.
Rahman
MM
,
Davis
DN
. Machine learning-based missing value imputation method for clinical datasets. In:
Yang
G-C
,
Ao
S-l
,
Gelman
L
, eds.
IAENG Transactions on Engineering Technologies
.
Springer
;
2013
:
245
-
257
.
30.
Phung
S
,
Kumar
A
,
Kim
J
. A deep learning technique for imputing missing healthcare data.
2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)
;
2019
:
6513
-
6516
.
31.
van Buuren
S
,
Groothuis-Oudshoorn
K
.
mice: multivariate imputation by chained equations in R
.
J Stat Softw
.
2011
;
45
(
3
):
1
-
67
.
32.
Mitchell
TM
. Machine Learning.
McGraw-Hill
;
1997
.
33.
Breiman
L
,
Breiman
L
.
Random Forests
.
Machine Learning 2001
.
2001
;
45
:
1
. 10;45(1).
34.
Berrar
D
. Cross-Validation.
Encyclopedia of Bioinformatics and Computational Biology
.
2019/01/01
.
35.
Abraham
A
,
Pedregosa
F
,
Eickenberg
M
, et al
.
Frontiers | machine learning for neuroimaging with scikit-learn
.
Frontiers in Neuroinformatics
.
2014
:
8
.
36.
Nagata
Y
,
Makishima
H
,
Kerr
CM
, et al
.
Invariant patterns of clonal succession determine specific clinical features of myelodysplastic syndromes
.
Nat Commun
.
2019
;
10
(
1
):
5386
.
37.
Babushok
DV
.
A brief, but comprehensive, guide to clonal evolution in aplastic anemia
.
Hematology Am Soc Hematol Educ Program
.
2018
;
2018
(
1
):
457
-
466
.
38.
Dermawan
JK
,
Wensel
C
,
Visconte
V
,
Maciejewski
JP
,
Cook
JR
,
Bosler
DS
.
Clinically significant CUX1 mutations are frequently subclonal and common in myeloid disorders with a high number of co-mutated genes and dysplastic features
.
Am J Clin Pathol
.
2022
;
157
(
4
):
586
-
594
.
39.
Li
Y
,
Li
X
,
Ge
M
, et al
.
Long-term follow-up of clonal evolutions in 802 aplastic anemia patients: a single-center experience
.
Ann Hematol
.
2011
;
90
(
5
):
529
-
537
.
40.
Gurnari
C
,
Pagliuca
S
,
Kewan
T
, et al
.
Is nature truly healing itself? spontaneous remissions in paroxysmal nocturnal hemoglobinuria
.
Blood Cancer J
.
2021
;
11
(
11
):
187
.
41.
Killick
SB
,
Bown
N
,
Cavenagh
J
, et al
.
Guidelines for the diagnosis and management of adult aplastic anaemia
.
Br J Haematol
.
2016
;
172
(
2
):
187
-
207
.
42.
Schaefer
J
,
Lehne
M
,
Schepers
J
,
Prasser
F
,
Thun
S
.
The use of machine learning in rare diseases: a scoping review
.
Orphanet J Rare Dis
.
2020
;
15
(
1
):
145
.
43.
Varoquaux
G
,
Colliot
O
.
Evaluating machine learning models and their diagnostic value
.
Neuromethods
.
2023
.
44.
Vidyasagar
M
.
Identifying predictive features in drug response using machine learning: opportunities and challenges
.
Annu Rev Pharmacol Toxicol
.
2015
;
55
:
15
-
34
.
45.
Shapiro
DE
.
The interpretation of diagnostic tests
.
Stat Methods Med Res
.
1999
;
8
(
2
):
113
-
134
.
46.
Monaghan
TF
,
Rahman
SN
,
Agudelo
CW
, et al
.
Foundational statistical principles in medical research: sensitivity, specificity, positive predictive value, and negative predictive value
.
Medicina
.
2021
;
57
(
5
):
503
.
47.
Dobson
JE
,
Dobson
JE
.
On reading and interpreting black box deep neural networks
.
Int J Digit Humanit
.
2023
;
5
(
2
):
431
-
449
.
48.
Louhichi
M
,
Nesmaoui
R
,
Mbarek
M
,
Lazaar
M
.
Shapley values for explaining the black box nature of machine learning model clustering
.
Procedia Computer Science
.
2023
;
220
:
806
-
811
.
49.
Rodríguez-Pérez
R
,
Bajorath
J
.
Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions
.
J Comput Aided Mol Des
.
2020
;
34
(
10
):
1013
-
1026
.
50.
Stone
M
.
Cross-validatory choice and assessment of statistical predictions
.
Journal of the Royal Statistical Society
.
1974
;
36
(
2
):
111
-
133
.
51.
Molinaro
AM
,
Simon
R
,
Pfeiffer
RM
.
Prediction error estimation: a comparison of resampling methods
.
Bioinformatics
.
2005
;
21
(
15
):
3301
-
3307
.
52.
Lee
C
,
Kim
HN
,
Kwon
JA
, et al
.
Identification of a complex karyotype signature with clinical implications in AML and MDS-EB using gene expression profiling
.
Cancers (Basel)
.
2023
;
15
(
21
):
5289
.

Author notes

A.C.T. and M.M. are joint first authors.

Deidentified data, study protocol, and source code are available on request from the corresponding author, Taha Bat (taha.bat@utsouthwestern.edu).

The full-text version of this article contains a data supplement.