Key Points
Prognostic capacity varied across 8 allogeneic transplantation scores, with rPAM showing modest benefit across several outcomes.
EASIx, a biomarker-based prediction model, is among the strongest predictive scores of NRM.
Abstract
Clinical decisions in allogeneic hematopoietic stem cell transplantation (allo-HSCT) are supported by the use of prognostic scores for outcome prediction. Scores vary in their features and in the composition of development cohorts. We sought to externally validate and compare the performance of 8 commonly applied scoring systems on a cohort of allo-HSCT recipients. Among 528 patients studied, acute myeloid leukemia was the leading transplant indication (44%) and 46% of patients had a matched sibling donor. Most models successfully grouped patients into higher and lower risk strata, supporting their use for risk classification. However, discrimination varied (2-year overall survival area under the receiver operating characteristic curve [AUC]: revised Pretransplantation Assessment of Mortality [rPAM], 0.64; PAM, 0.63; revised Disease Risk Index [rDRI], 0.62; Endothelial Activation and Stress Index [EASIx], 0.60; combined European Society for Blood and Marrow Transplantation [EBMT]/Hematopoietic Cell Transplantation-specific Comorbidity Index [HCT-CI], 0.58; EBMT, 0.58; Comorbidity-Age, 0.58; HCT-CI, 0.55); AUC ranges from 0.5 (random) to 1.0 (perfect prediction). rPAM and PAM, which had the greatest predictive capacity across all outcomes, are comprehensive models including patient, disease, and transplantation information. Interestingly, EASIx, a biomarker-driven model, had comparable performance for nonrelapse mortality (NRM; 2-year AUC, 0.65) but no predictive value for relapse (2-year AUC, 0.53). Overall, allo-HSCT prognostic systems may be useful for risk stratification, but individual prediction remains a challenge, as reflected by the scores’ limited discriminative capacity.
Introduction
Given the potential benefits and perils associated with allogeneic hematopoietic stem cell transplantation (HSCT), informed risk estimation is an integral part of candidate evaluation. The past 20 years have seen the proliferation of risk indices for the prediction of HSCT outcomes. These models can be useful for patient counseling, treatment strategy optimization, and statistical analysis across cohorts.1,2 Scores are based on a variety of different sets of parameters. The Hematopoietic Cell Transplantation–specific Comorbidity Index (HCT-CI)3 and its derivative Comorbidity-Age Index4 are based on the patient’s comorbidity profile. The score of the European Society for Blood and Marrow Transplantation (EBMT)5 includes characteristics of the patient (age), disease (status, time from transplantation), and donor (relation, donor-recipient HLA match, and sex match). The Comorbidity-EBMT index, proposed by Barba et al in 2014, combines the comorbidity-specific information from the HCT-CI with the broader range of values included in the EBMT score.6 In 2006, the Pretransplantation Assessment of Mortality (PAM) score7 was published with some features that overlap with the EBMT score (it also includes information on conditioning and laboratory markers of comorbidities); this score was simplified in 2015 (revised PAM [rPAM]),8 leaving only age, donor type, disease status, and pulmonary function, while adding donor and recipient cytomegalovirus (CMV) serostatus. Most recently, the Endothelial Activation and Stress Index (EASIx),9 a laboratory biomarker-based formula including serum creatinine, lactate dehydrogenase, and platelet count, was developed for the prediction of survival in patients developing acute graft-versus-host disease; this score has been extended into the general prediction of mortality when measured pretransplantation.10 Another prognostic tool commonly used for risk stratification is the revised Disease Risk Index (rDRI), which incorporates disease type and status at time of transplantation.11 Supplemental Table 1 describes each score’s components. Data regarding the comparative performance of these scores in the same population are lacking. We aimed to externally validate and compare the performance of these 8 systems in a contemporary cohort of transplantation patients across several outcomes.
Methods
Data collection
Clinical and laboratory data prior to transplantation were obtained from the electronic medical record for allogeneic transplantations performed between 2011 and 2015 at Chaim Sheba Medical Center at Tel HaShomer, Ramat Gan, Israel. Outcomes were cross-referenced with the national social security registry for survival of any patients lost to follow-up. We included adult patients (age ≥18 years) who underwent transplantation for any indication and who received grafts from matched sibling (MSD), matched unrelated (MUD; 10 of 10 HLA alleles), or mismatched unrelated (9 of 10 HLA alleles) donors. Patients missing the parameters necessary for calculation of all studied scores were excluded from the analysis, with the following exceptions: 38 patients with missing patient-donor CMV serostatus pair were included in the overall study but excluded from the analysis of the rPAM score; for 17 patients missing time from diagnosis to transplantation, a component of the EBMT score, this value was imputed using the median value of all patients with the same diagnosis and disease stage. The rDRI was inapplicable to 18 patients treated for nonmalignant conditions who were excluded from the assessment of that system. Conditioning regimens were deemed to be myeloablative or reduced-intensity based on the definitions of Bacigalupo et al,12 with treosulfan-based regimens also considered myeloablative.13 Comorbidities were measured by a qualified transplant physician using definitions provided by the HCT-CI.3 Human subject research was approved by the Chaim Sheba Medical Center Institutional Review Board, and all research was performed in accordance with the Declaration of Helsinki.
Statistical analysis
All outcomes were measured from the time of HSCT. Nonrelapse mortality (NRM) was defined as death without the competing event of relapse following HSCT. Time of relapse was determined by a clinical finding of recurrent disease.
Prognostic scores were calculated for each patient using the definitions provided in the publications of these prognostic indices: the HCT-CI,3 the Comorbidity-Age Index (Comorbidity-Age),4 the combined HCT-CI and EBMT (Comorbidity-EBMT),6 the EASIx,9 the PAM7 and rPAM8 scores, and the EBMT5 score (supplemental Table 1). For determination of disease status in calculation of the PAM score, we used the definition provided by the EBMT score similar to the example set by Barba et al.14 Disease status in the rPAM score was determined using the categories of the rDRI,11 with nonmalignant diagnoses assigned an intermediate disease stage, as specified in the rPAM’s initial publication by Au et al.8 Scores were tested for normal distribution in the population using the Shapiro-Wilk test, and Pearson product-moment correlation coefficients were calculated between each pair of scores. For a subanalysis including only patients who received transplants for acute leukemia, an additional score, the Acute Leukemia–EBMT (AL-EBMT; developed using a machine learning technique), was calculated as well, following the schema outlined by Shouval et al.15
The scores were grouped into 3 to 6 levels each for estimating overall survival (OS), NRM, and relapse incidence using the Kaplan-Meier and cumulative incidence methods and compared using the log-rank and Gray tests. Multivariable regression, adjusted for age, donor type, conditioning intensity, and year, was performed only for the HCT-CI and EASIx scores because these covariates were themselves components of the remaining scores. Additionally, a multivariable model was built separately for the rDRI, adjusting for the same covariates other than disease risk.
Score discrimination was measured using the area under the receiver operating characteristic curve (AUC). Discrimination reflects the ability of a prediction model to differentiate between those who do and do not experience the studied outcome. Perfect discrimination corresponds with an AUC of 1.0, meaning that the predicted risk for all individuals who developed the outcome is higher than that for all individuals who did not experience the outcome. An AUC of 0.5 is indicative of a random predictor, that is, a coin toss.16 AUCs were calculated across the entire cohort for the prediction of OS, NRM, and relapse incidence at 100-day and 1-, 2-, and 3-year time points in each of the scores independently. AUCs were further validated using a bootstrapping technique with 100 samples, with median and interquartile range (IQR) reported in the supplemental Appendix. Calibration, the agreement between prediction and observed outcome, was assessed graphically by plotting predicted vs observed outcomes for score quartiles at the 100-day and 1- and 2-year time points. A model is considered well calibrated if, for example, among a group of 100 patients with a mean predicted risk of 20%, ∼20 patients develop the outcome.17 Finally, within a variety of subpopulations, discrimination was assessed to determine whether each score performed better or worse under a given set of conditions.
Statistical analyses were performed using R version 3.4.3 (R Foundation for Statistical Computing) and the packages “survival,” “cmprsk,” “prodlim,” “pec,” “rms,” and “ggplot2.”
Results
Population characteristics
Population characteristics are provided in Table 1. A total of 528 patients was included. The median age was 55 years (IQR, 40-64 years). Patients were treated for a variety of malignant and benign conditions. Acute myelogenous leukemia was most prevalent, at 44% of patients, followed by the myelodysplastic syndrome (14%) and non-Hodgkin lymphoma (12%). Fifty-six percent of patients had intermediate-risk disease per the rDRI at time of transplantation. The majority of patients received pretransplantation conditioning with a myeloablative regimen (74%), and 10% of patients were conditioned with a regimen including total body irradiation at any dose. HLA-matched sibling donors were used for 46% of patients treated. The median follow-up was 2.5 years (IQR, 1.7-3.9 years). Supplemental Table 2 compares our cohort to the derivation cohorts of each of the original scores.
Characteristic . | n (%) or median [IQR] . | Missing (%) . | Included in* . |
---|---|---|---|
Age, y | 55 [40, 64] | 0 (0) | C-A, EBMT, PAM, rPAM, C-E |
Days from diagnosis to HSCT | 189 [104, 596] | 17 (3.2) | EBMT, C-E |
Diagnosis | 0 (0) | EBMT, PAM, rPAM, C-E, rDRI | |
AML | 233 (44.1) | ||
ALL | 52 (9.8) | ||
CLL | 11 (2.1) | ||
CML | 14 (2.7) | ||
HL | 19 (3.6) | ||
MDS | 78 (14.8) | ||
MM | 20 (3.8) | ||
MF | 20 (3.8) | ||
NHL | 65 (12.3) | ||
AA/nonmalignant | 16 (3.0) | ||
Serum ALT, U/L | 32 [18, 53] | 0 (0) | HCT-CI, C-A, PAM, C-E |
Serum creatinine, mg/dL | 0.85 [0.73, 1.03] | 0 (0) | HCT-CI, C-A, PAM, EASIx, C-E |
Serum LDH, U/L | 208 [173, 276] | 0 (0) | EASIx |
Platelets, ×109/L | 132 [64, 190] | 0 (0) | EASIx |
FEV1, % expected | 95 [84, 104] | 0 (0) | HCT-CI, C-A, PAM, rPAM, C-E |
DLCo, adjusted for Hb | 92.9 [80.4, 109.1] | 0 (0) | HCT-CI, C-A, C-E |
Disease risk | 0 (0) | rPAM,† rDRI | |
Low risk | 29 (5.5) | ||
Intermediate risk | 298 (56.4) | ||
High risk | 144 (27.3) | ||
Very high risk | 39 (7.4) | ||
Not applicable | 18 (3.4) | ||
Regimen intensity | 0 (0) | ||
Myeloablative | 391 (74.1) | ||
Reduced intensity | 137 (25.9) | ||
TBI-containing regimen | 54 (10.2) | 0 (0) | PAM |
Donor | 0 (0) | EBMT, PAM, rPAM, C-E | |
Matched sibling | 241 (45.6) | ||
Matched unrelated, 10/10 | 207 (39.2) | ||
Mismatched unrelated, 9/10 | 80 (15.2) | ||
Female to male | 126 (23.9) | 0 (0) | EBMT |
CMV serostatus pair, % | 38 (7.2) | rPAM | |
Donor − Recipient − | 52 (9.8) | ||
Donor − Recipient + | 96 (18.2) | ||
Donor + Recipient − | 30 (5.7) | ||
Donor + Recipient + | 312 (59.1) |
Characteristic . | n (%) or median [IQR] . | Missing (%) . | Included in* . |
---|---|---|---|
Age, y | 55 [40, 64] | 0 (0) | C-A, EBMT, PAM, rPAM, C-E |
Days from diagnosis to HSCT | 189 [104, 596] | 17 (3.2) | EBMT, C-E |
Diagnosis | 0 (0) | EBMT, PAM, rPAM, C-E, rDRI | |
AML | 233 (44.1) | ||
ALL | 52 (9.8) | ||
CLL | 11 (2.1) | ||
CML | 14 (2.7) | ||
HL | 19 (3.6) | ||
MDS | 78 (14.8) | ||
MM | 20 (3.8) | ||
MF | 20 (3.8) | ||
NHL | 65 (12.3) | ||
AA/nonmalignant | 16 (3.0) | ||
Serum ALT, U/L | 32 [18, 53] | 0 (0) | HCT-CI, C-A, PAM, C-E |
Serum creatinine, mg/dL | 0.85 [0.73, 1.03] | 0 (0) | HCT-CI, C-A, PAM, EASIx, C-E |
Serum LDH, U/L | 208 [173, 276] | 0 (0) | EASIx |
Platelets, ×109/L | 132 [64, 190] | 0 (0) | EASIx |
FEV1, % expected | 95 [84, 104] | 0 (0) | HCT-CI, C-A, PAM, rPAM, C-E |
DLCo, adjusted for Hb | 92.9 [80.4, 109.1] | 0 (0) | HCT-CI, C-A, C-E |
Disease risk | 0 (0) | rPAM,† rDRI | |
Low risk | 29 (5.5) | ||
Intermediate risk | 298 (56.4) | ||
High risk | 144 (27.3) | ||
Very high risk | 39 (7.4) | ||
Not applicable | 18 (3.4) | ||
Regimen intensity | 0 (0) | ||
Myeloablative | 391 (74.1) | ||
Reduced intensity | 137 (25.9) | ||
TBI-containing regimen | 54 (10.2) | 0 (0) | PAM |
Donor | 0 (0) | EBMT, PAM, rPAM, C-E | |
Matched sibling | 241 (45.6) | ||
Matched unrelated, 10/10 | 207 (39.2) | ||
Mismatched unrelated, 9/10 | 80 (15.2) | ||
Female to male | 126 (23.9) | 0 (0) | EBMT |
CMV serostatus pair, % | 38 (7.2) | rPAM | |
Donor − Recipient − | 52 (9.8) | ||
Donor − Recipient + | 96 (18.2) | ||
Donor + Recipient − | 30 (5.7) | ||
Donor + Recipient + | 312 (59.1) |
AA, aplastic anemia; ALL, acute lymphoid leukemia; ALT, alanine aminotransferase; AML, acute myeloid leukemia; C-A, comorbidity-age; C-E, comorbidity-EBMT; CLL, chronic lymphocytic leukemia; CML, chronic myelogenous leukemia; DLCo, diffusing capacity for carbon monoxide; FEV1, forced expiratory volume in 1 second; HL, Hodgkin lymphoma; LDH, lactate dehydrogenase; MDS, myelodysplastic syndrome; MF, myelofibrosis; MM, multiple myeloma; NHL, non-Hodgkin lymphoma; TBI, total body irradiation.
Additional comorbidity variables are included in the HCT-CI and C-A scores.
As defined by the rDRI.11 Alternative disease-staging schemes are included in the EBMT and PAM scores.
Score distributions
Each score was calculated for all patients in the cohort, with the exception of the rPAM, which could not be calculated for 38 patients (7%) due to missing donor CMV serostatus, and the rDRI, which was inapplicable to 19 patients (4%). The distribution for each score is shown in Figure 1 and supplemental Figure 1. Scores were positively correlated (supplemental Figure 2). The Pearson correlation between scores was generally below 0.50, except for scores whose components substantially overlap, such as Comorbidity-EBMT, Comorbidity-Age, and HCT-CI; the rPAM, which includes the rDRI, or the EBMT; and PAM, which shares a definition of disease risk. Scores were nonnormally distributed (P < .001 in all cases), with more patients having low (favorable) than high (adverse) values. EASIx was notable for its distant outliers, with 75% of values between 0 and 3.76, and the remaining quartile extending to 212 (Figure 1F inset).
Outcomes
The highest-risk stratum was associated with increased risk for overall mortality in the Comorbidity-Age, Comorbidity-EBMT, EBMT, PAM, rPAM, and EASIx scores and the rDRI in the univariable setting (Table 2; supplemental Table 3; Figure 2). However, a monotonic increase (ie, increasing risk with each score stratum) was best observed in rPAM (hazard ratio, 1.5, 2.5, 3.3), corresponding with decreasing OS probability. Similar results were observed for NRM, with hazard ratios ≥ 3.0 in the Comorbidity-Age, PAM, rPAM, and EASIx for the highest-risk strata. Relapse was not predicted by the HCT-CI or EASIx scores. A multivariable analysis, adjusted for age, donor type, conditioning intensity disease status, and year, was conducted only for HCT-CI and EASIx scores, as these indices do not include disease- and transplant-related features (supplemental Table 4); the highest-risk stratum of EASIx remained an independent predictor of overall and NRM. A similar multivariable analysis for the rDRI, incorporating age, donor type, conditioning intensity, and year of transplantation, demonstrated that the rDRI remained a predictor of overall and NRM as well as relapse. Additionally, higher rDRI levels were associated with increasing risk in the multivariable models for both HCT-CI and EASIx, further supporting these indices’ potentially additive role.
Score . | Level . | OS . | NRM . | Relapse incidence . | |||
---|---|---|---|---|---|---|---|
2-y OS, % (range) . | Log-rank P . | 2-y NRM, % (range) . | Gray P . | 2-y RI, % (range) . | Gray P . | ||
HCT-CI | 0 | 50.1 (40.7-61.8) | 15.3 (9.5-24.6) | 34.4 (26.1-45.3) | |||
1-2 | 60.2 (53.4-67.9) | 18.6 (13.8-25.0) | 24.8 (19.2-31.9) | ||||
3+ | 46.5 (40.2-53.7) | .037 | 24.7 (19.6-31.1) | .203 | 30.7 (25.2-37.3) | .174 | |
Comorbidity-Age | 0 | 69.6 (55.3-87.6) | 11.1 (4.4-28.0) | 22.4 (11.5-43.7) | |||
1-2 | 55.8 (48.8-63.7) | 17.0 (12.3-23.3) | 28.3 (22.6-35.5) | ||||
3-4 | 55.7 (48.5-64.0) | 18.9 (13.7-26.0) | 27.6 (21.5-35.3) | ||||
5+ | 36.6 (28.6-46.8) | <.001 | 31.3 (24.0-40.9) | .006 | 34.8 (27.3-44.3) | .389 | |
Comorbidity-EBMT | 0/<4 | 49.9 (38.0-65.7) | 14.6 (8.0-26.7) | 34.3 (23.9-49.3) | |||
0/≥4 | 50.1 (36.1-69.7) | 16.4 (7.7-34.9) | 34.6 (22.6-52.9) | ||||
I-II/<4 | 66.4 (57.4-76.8) | 16.1 (10.1-25.5) | 23.0 (15.7-33.6) | ||||
I-II/≥4 | 53.9 (44.3-65.6) | 21.1 (14.3-31.1) | 26.5 (18.9-37.1) | ||||
III+/<4 | 56.3 (47.1-67.3) | 16.3 (10.4-25.7) | 28.8 (21.3-38.9) | ||||
III+/≥4 | 38.1 (30.2-48.0) | .001 | 31.7 (24.4-41.1) | .019 | 32.3 (25.0-41.7) | .531 | |
EBMT | 0-2 | 63.0 (54.7-72.6) | 15.9 (10.5-24.1) | 25.3 (18.5-34.5) | |||
3 | 54.4 (46.3-64.0) | 16.0 (10.7-23.8) | 30.1 (18.5-34.5) | ||||
4 | 46.5 (38.0-56.9) | 24.3 (17.5-33.6) | 30.1 (22.9-39.6) | ||||
5 | 46.5 (36.5-59.2) | 22.0 (14.6-33.1) | 34.6 (25.7-46.7) | ||||
6-7 | 43.9 (32.6-59.2) | .007 | 32.5 (22.4-47.1) | .046 | 25.9 (16.7-40.1) | .346 | |
rDRI | Low | 72.7 (57.1-92.6) | 15.9 (6.4-39.5) | 15.0 (6.0-37.4) | |||
Intermediate | 61.4 (55.8-67.5) | 16.2 (12.4-21.2) | 23.6 (19.1-29.1) | ||||
High | 32.1 (24.9-41.3) | 26.5 (20.1-35.0) | 44.9 (37.4-54.0) | ||||
Very high | 31.5 (19.5-50.7) | <.001 | 33.5 (21.5-52.3) | .024 | 35.9 (23.6-54.6) | <.001 | |
PAM | <15 | 64.6 (56.8-73.5) | 12.5 (8.0-19.5) | 24.7 (18.3-33.3) | |||
15-20 | 59.7 (51.4-69.3) | 16.6 (11.0-25.0) | 24.4 (18.0-33.1) | ||||
20-25 | 47.2 (38.8-57.6) | 22.4 (16.0-31.2) | 35.6 (28.0-45.3) | ||||
>25 | 35.0 (27.5-44.7) | <.001 | 32.4 (25.2-41.7) | .001 | 33.2 (25.9-42.5) | .036 | |
rPAM | <12.3 | 71.2 (63.6-79.7) | 12.1 (7.5-19.5) | 17.7 (12.2-25.7) | |||
12.3-16.5 | 61.0 (52.2-71.2) | 15.5 (10.0-24.0) | 24.0 (17.2-33.6) | ||||
16.6-21.9 | 42.5 (34.0-53.0) | 23.9 (17.3-33.1) | 36.0 (28.2-45.9) | ||||
>21.9 | 35.7 (27.8-46.0) | <.001 | 34.4 (26.8-44.2) | <.001 | 33.7 (26.1-43.4) | .003 | |
EASIx | <0.89 | 66.1 (58.2-75.1) | 11.1 (6.7-18.2) | 26 (19.3-34.9) | |||
0.89-1.40 | 54.8 (46.4-64.7) | 11.3 (6.9-18.6) | 35.3 (27.8-44.8) | ||||
1.40-3.76 | 49.0 (40.9-58.7) | 27.8 (21.0-36.9 | 26.2 (19.7-34.9) | ||||
>3.76 | 38.8 (30.9-48.6) | <.001 | 32.0 (24.8-41.2) | <.001 | 29.6 (22.6-38.6) | .377 |
Score . | Level . | OS . | NRM . | Relapse incidence . | |||
---|---|---|---|---|---|---|---|
2-y OS, % (range) . | Log-rank P . | 2-y NRM, % (range) . | Gray P . | 2-y RI, % (range) . | Gray P . | ||
HCT-CI | 0 | 50.1 (40.7-61.8) | 15.3 (9.5-24.6) | 34.4 (26.1-45.3) | |||
1-2 | 60.2 (53.4-67.9) | 18.6 (13.8-25.0) | 24.8 (19.2-31.9) | ||||
3+ | 46.5 (40.2-53.7) | .037 | 24.7 (19.6-31.1) | .203 | 30.7 (25.2-37.3) | .174 | |
Comorbidity-Age | 0 | 69.6 (55.3-87.6) | 11.1 (4.4-28.0) | 22.4 (11.5-43.7) | |||
1-2 | 55.8 (48.8-63.7) | 17.0 (12.3-23.3) | 28.3 (22.6-35.5) | ||||
3-4 | 55.7 (48.5-64.0) | 18.9 (13.7-26.0) | 27.6 (21.5-35.3) | ||||
5+ | 36.6 (28.6-46.8) | <.001 | 31.3 (24.0-40.9) | .006 | 34.8 (27.3-44.3) | .389 | |
Comorbidity-EBMT | 0/<4 | 49.9 (38.0-65.7) | 14.6 (8.0-26.7) | 34.3 (23.9-49.3) | |||
0/≥4 | 50.1 (36.1-69.7) | 16.4 (7.7-34.9) | 34.6 (22.6-52.9) | ||||
I-II/<4 | 66.4 (57.4-76.8) | 16.1 (10.1-25.5) | 23.0 (15.7-33.6) | ||||
I-II/≥4 | 53.9 (44.3-65.6) | 21.1 (14.3-31.1) | 26.5 (18.9-37.1) | ||||
III+/<4 | 56.3 (47.1-67.3) | 16.3 (10.4-25.7) | 28.8 (21.3-38.9) | ||||
III+/≥4 | 38.1 (30.2-48.0) | .001 | 31.7 (24.4-41.1) | .019 | 32.3 (25.0-41.7) | .531 | |
EBMT | 0-2 | 63.0 (54.7-72.6) | 15.9 (10.5-24.1) | 25.3 (18.5-34.5) | |||
3 | 54.4 (46.3-64.0) | 16.0 (10.7-23.8) | 30.1 (18.5-34.5) | ||||
4 | 46.5 (38.0-56.9) | 24.3 (17.5-33.6) | 30.1 (22.9-39.6) | ||||
5 | 46.5 (36.5-59.2) | 22.0 (14.6-33.1) | 34.6 (25.7-46.7) | ||||
6-7 | 43.9 (32.6-59.2) | .007 | 32.5 (22.4-47.1) | .046 | 25.9 (16.7-40.1) | .346 | |
rDRI | Low | 72.7 (57.1-92.6) | 15.9 (6.4-39.5) | 15.0 (6.0-37.4) | |||
Intermediate | 61.4 (55.8-67.5) | 16.2 (12.4-21.2) | 23.6 (19.1-29.1) | ||||
High | 32.1 (24.9-41.3) | 26.5 (20.1-35.0) | 44.9 (37.4-54.0) | ||||
Very high | 31.5 (19.5-50.7) | <.001 | 33.5 (21.5-52.3) | .024 | 35.9 (23.6-54.6) | <.001 | |
PAM | <15 | 64.6 (56.8-73.5) | 12.5 (8.0-19.5) | 24.7 (18.3-33.3) | |||
15-20 | 59.7 (51.4-69.3) | 16.6 (11.0-25.0) | 24.4 (18.0-33.1) | ||||
20-25 | 47.2 (38.8-57.6) | 22.4 (16.0-31.2) | 35.6 (28.0-45.3) | ||||
>25 | 35.0 (27.5-44.7) | <.001 | 32.4 (25.2-41.7) | .001 | 33.2 (25.9-42.5) | .036 | |
rPAM | <12.3 | 71.2 (63.6-79.7) | 12.1 (7.5-19.5) | 17.7 (12.2-25.7) | |||
12.3-16.5 | 61.0 (52.2-71.2) | 15.5 (10.0-24.0) | 24.0 (17.2-33.6) | ||||
16.6-21.9 | 42.5 (34.0-53.0) | 23.9 (17.3-33.1) | 36.0 (28.2-45.9) | ||||
>21.9 | 35.7 (27.8-46.0) | <.001 | 34.4 (26.8-44.2) | <.001 | 33.7 (26.1-43.4) | .003 | |
EASIx | <0.89 | 66.1 (58.2-75.1) | 11.1 (6.7-18.2) | 26 (19.3-34.9) | |||
0.89-1.40 | 54.8 (46.4-64.7) | 11.3 (6.9-18.6) | 35.3 (27.8-44.8) | ||||
1.40-3.76 | 49.0 (40.9-58.7) | 27.8 (21.0-36.9 | 26.2 (19.7-34.9) | ||||
>3.76 | 38.8 (30.9-48.6) | <.001 | 32.0 (24.8-41.2) | <.001 | 29.6 (22.6-38.6) | .377 |
Discrimination and calibration
Discrimination for OS, NRM, and relapse at 100 days, 1 year, 2 years, and 3 years posttransplantation is described in Figure 3, with further validation using 100 bootstraps presented in supplemental Table 5. For OS, AUCs ranged from 0.55 to 0.67. Values were highest for the PAM and rPAM scores across all time points, ranging from 0.62 to 0.66 for PAM and 0.63 to 0.67 for rPAM. The EASIx score showed comparable discrimination at day 100 (0.64), subsequently decreasing to as low as 0.58 at 3 years. The EBMT, HCT-CI, and Comorbidity-Age scores had AUCs ranging from 0.56 to 0.60 across all time points. Through the 2-year time point, PAM, rPAM, and EASIx had closely aligned AUCs for NRM (ranging from 0.63 to 0.67), though EASIx decreased at 3 years while PAM and rPAM remained stable. AUCs were lower overall for the prediction of relapse, with the highest AUC associated with the rPAM score and rDRI at 100 days and 1 year (ranging from 0.63 to 0.65); other AUCs for relapse were mostly in the 0.5 to 0.6 range. EASIx had the lowest AUCs for relapse at all time points. All scores were well calibrated for OS (supplemental Figure 3A-C).
Subpopulations
Score performance was further studied by age (<55 years, ≥55 years), donor type (MSD, MUD), and conditioning intensity (myeloablative conditioning [MAC], reduced-intensity conditioning [RIC]). PAM and rPAM had higher AUCs in the younger age group (0.65 vs 0.59, 0.69 vs 0.61, respectively; supplemental Figure 4). In contrast, EASIx performed better among older patients (0.61 vs 0.56). Most prognostic indices had similar discrimination irrespective of donor, with 2 exceptions. The PAM score demonstrated greater discrimination in the MSD setting (MSD, 0.68 vs MUD, 0.59) whereas the EASIx score had greater discrimination in the MUD setting (MSD, 0.56 vs MUD, 0.62). Higher AUCs were observed for the myeloablative subgroup for the rPAM, HCT-CI, Comorbidity-Age, and EASIX scores (MAC, 0.70 vs RIC, 0.61; 0.57 vs 0.51; 0.59 vs 0.53; and 0.63 vs 0.53, respectively).
Additionally, the acute leukemias, representing the most common indication for allogeneic transplantation, were studied separately. The rDRI, rPAM, EBMT, PAM, Comorbidity-EBMT, and EASIx scores all had AUCs in the low 0.6 range for 2-year OS. An additional score, the AL-EBMT, which is applicable only to the acute leukemias, was also included and had a similar AUC (0.63).
Discussion
In this retrospective analysis, we compared 8 prognostic models in a cohort of allogeneic transplant recipients. Score prediction performance, in terms of risk stratification and discrimination, varied considerably, both across outcomes and subgroups. The majority of models, most notably rPAM, successfully grouped patients into lower- and higher-risk strata, supporting their use for risk classification. However, accurate individualized prediction remains suboptimal. Similar to previous studies, the best score performances approached an AUC of 0.70 on a scale of 0.50 to 1.00, necessitating caution when making individual clinical decisions based on these tools. Score performance varied based on the outcome being measured, an effect observed most strikingly in the EASIx score, which was among the strongest predictors of NRM but had little or no information regarding relapse. Intriguingly, rPAM was roughly consistent in its prognostic capacity across all 3 outcomes studied.
Death following transplantation is typically understood as the tension between 2 competing events: transplantation-related mortality and relapse. Naturally, pretransplantation disease features tend to predict relapse whereas patient-specific characteristics are more indicative of transplantation-related mortality. Prognostic models in HSCT could be viewed as global scores, (eg, rPAM, PAM, EBMT, and Conditioning-EBMT), which incorporate variables from several domains to provide an estimate of the expected OS, vs domain-specific scores that include specific patient-related or disease-related features (eg, HCT-CI, Comorbidity-Age, EASIx, and rDRI). Depending on their components, the latter group may be informative of the risk of relapse or NRM. Indeed, rDRI was predictive of relapse, whereas EASIx was among the top predictors of NRM (Figure 3B). A clinician may use information from domain-based scores, but it is not clear how these should be combined with respect to balancing the benefit-risk ratio of transplantation. Therefore, a point could be made in favor of global scores integrating components predictive of relapse and NRM. The poor correlation between the EASIx or HCT-CI scores and all other scores (supplemental Figure 2) suggests that they may be additive. However, combination of the HCT-CI and EBMT (Comorbidity-EBMT) did not result in a meaningful improvement of prediction. Each of the global scores incorporates a variable for the risk inherent to the diagnosis and stage. The higher discrimination with rPAM may be attributed to the incorporation of the rDRI,11,18 which is a more contemporary disease-risk scheme than the EBMT score’s embedded disease-risk criteria. Overall, the disease-risk variable is perhaps the single greatest predictor of transplantation outcomes.15,18
Determining the generalizability of prognostic scores and avoiding overoptimistic performance assessment requires external validation. Validation studies of each of these scores have been published,19-23 however, few direct comparisons on the same population have rarely been performed, and mostly include 2 or 3 scores (HCT-CI and EBMT or PAM).6,14,24-27 Furthermore, methodologies have varied; Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) guidelines,17,28 which outline best practices for model development and validation, are inconsistently followed. In accordance with the TRIPOD recommendations, we report both the calibration and discrimination of the models. Overall, the models studied are well calibrated, indicating that predicted outcomes are aligned with observations. However, in the HCT-CI score, there is poor calibration for lower scores due to overestimation of survival for patients in the lowest score quartile. This is corrected in the Comorbidity-Age, implying that advanced age reduces overoptimism in the lower HCT-CI risk groups.
Score performances varied across subpopulations (supplemental Figure 4). In the rPAM, HCT-CI, and Comorbidity-Age scores, discriminative capacity was higher in the MAC setting compared with RIC. This may reflect a predominance of MAC among development cohorts (supplemental Table 2). EASIx also demonstrated higher AUCs in the myeloablative cohort; although EASIx was developed with a large number of RIC patients, it could be argued that MAC patients are more susceptible to the endothelial dysfunction that the score was initially developed to predict. Because acute leukemia is the leading indication for allogeneic transplantation, we performed a subanalysis restricted to this population (supplemental Figure 4D). The AL-EBMT score, an acute leukemia-specific score that was the first machine-learning–based predictive model developed in allogeneic transplantation, was also incorporated.15 The consistency of AUCs (range, 0.60-0.64) across all of the scores (except the comorbidity indices, which were lower), irrespective of different modeling approaches, suggests that databases comprising traditional parameters have been exhausted. Improvement in prediction will likely require the incorporation of novel biomarkers.
Models relying on robust and objective biomarkers may improve our ability to predict. We have previously shown that pretransplantation hypoalbuminemia and renal function abnormalities are among the strongest risk factors for poor outcomes in HSCT recipients.29 EASIx stands out for incorporating only laboratory-based markers, while performing similarly to other clinically oriented scores (rPAM, PAM) in the initial time points. Furthermore, when EASIx is studied in a multivariable analysis adjusting for key clinical features, the highest score strata maintained a strong association with increased mortality. Meeting its authors’ underlying hypothesis, the score’s accuracy is driven by the ability to predict NRM, a toxicity-based measure, whereas relapse is not predicted. The continuing identification of genetic and microbiome markers of both relapse and nonrelapse risk should motivate the field to pursue further biologically driven risk-prediction schemes.30-34
Differences in the scores’ predictive performance between our cohort and the derivation cohorts may stem from differences in the populations (supplemental Table 2). The current cohort represents a more recent transplant era compared with the original studies, which may partially account for discrepancy in the models’ performance. Also, previous studies have suggested that the utility of risk indices may be center-specific.35 Despite being a single-center cohort, these results recapitulate findings by independent validation studies. Alternative donors are not represented in our cohort and remain underrepresented across all validation studies. The generalizability of transplantation prognostic indices to the alternative donor setting has been studied in small cohorts with varying performance.19,21 The development of donor-specific systems will likely contribute to more accurate predictions.36 One must keep in mind that scores are limited to patients who received transplants and do not capture alternative treatments; therefore, a truly informed decision contemplating all potential therapeutic paths is not considered. In clinical practice, the scores’ greatest utility may be in identifying that subset of patients who are least likely to benefit from HSCT. In all scores, the highest stratum was associated with substantially increased risk of poor outcome. Given the limited correlation between scores driven by meaningfully different feature sets, each system may identify a different subset of these highest-risk patients. This approach simulates, and may augment, the clinical intuition that integrates a patient’s physiologic status (age, comorbidities) and procedural characteristics (donor, diagnosis, and stage). The highest risk represents a population with extremely limited alternative treatment options, however, these patients may be ideal candidates for clinical trials.
As novel therapeutic approaches emerge in hemato-oncology, the risk-benefit analysis for allogeneic transplantation becomes ever more important. In this retrospective comparison of the leading prognostic indices in HSCT, we show that most models can be used to stratify patients, but not to make individualized predictions. Barriers to improvement include, first and foremost, quantity and quality of source data as well as selection biases in development cohorts. Oversimplifications of the relationship between predictor and response, such as the use of categorized in place of continuous measures, and parametric assumptions on data behavior, may lead to the loss of prognostic information.37 Also, aside from an inherent stochastic component, the risk of detrimental outcomes following transplantation evolves over time, and patients remain susceptible to events that cannot be anticipated (eg, infection, graft-versus-host disease, depression).16 The advent of electronic medical records and large international registries now permits a more granular exploration of transplantation outcomes. Personalization of transplantation procedure may be made possible by developing new and specific prediction schemes based on large, homogenous cohorts, while integrating novel modeling techniques.15,36,38 Furthermore, big data analysis may identify modifiable features, which are predictive of clinical paths and therefore could be acted upon. We have previously shown that the risk of conditioning toxicity is dependent on the patient’s individual comorbidities rather than their cumulative burden, suggesting the potential for both the prediction and treatment optimization that such a granular approach allows.39 A new generation of prediction models, integrating the newfound wealth of data and biological knowledge, is needed to truly inform individual decision-making in allogeneic transplantation.
The full-text version of this article contains a data supplement.
Acknowledgments
This work was supported by The Varda and Boaz Dotan Research Center in Hemato-Oncology affiliated with the Cancer Biology Research Center of Tel Aviv University and The Shalvi Foundation for the Support of Medical Research.
Authorship
Contribution: J.A.F., A. Shouval, A.N., and R.S. designed the study; J.A.F. and R.S. wrote the initial draft of the manuscript; and all authors were involved in the collection and interpretation of the data, edited the initial draft of the manuscript, and agreed to the final manuscript.
Conflict-of-interest disclosure: The authors declare no competing financial interests.
Correspondence: Joshua A. Fein, Sackler School of Medicine, Tel Aviv University, Ramat Aviv 69978, Israel; e-mail: joshuafein@gmail.com.
References
Author notes
R.S., J.A.F., and A. Shouval contributed equally to this study.
Individual patient data will not be shared per institutional review board requirements.