Key Points
Quantitative PET radiomics and clinical features can be used to build a strong prognostic model for 3-year PFS in relapsed/refractory cHL.
We identified a subgroup of high-risk patients with R/R cHL with inferior PFS and overall survival for whom novel therapies should be considered.
Abstract
Investigating prognostic factors in patients with relapsed or primary refractory classical Hodgkin lymphoma (R/R cHL) is essential to optimize risk-adapted treatment strategies. We built a prognostic model using baseline quantitative 18F-fluorodeoxyglucose positron emission tomography (PET) radiomics features and clinical characteristics to predict the progression-free survival (PFS) among patients with R/R cHL treated with salvage chemotherapy followed by autologous stem cell transplantation. Metabolic tumor volume and several novel radiomics dissemination features, representing interlesional differences in distance, volume, and standard uptake value, were extracted from the baseline PET. Machine learning using backward selection and logistic regression were applied to develop and train the model on a total of 113 patients from 2 clinical trials. The model was validated on an independent external cohort of 69 patients. In addition, we validated 4 different PET segmentation methods to calculate radiomics features. We identified a subset of patients at high risk for progression with significant inferior 3-year PFS outcomes of 38.1% vs 88.4% for patients in the low-risk group in the training cohort (P < .001) and 38.5% vs 75.0% in the validation cohort (P = .015), respectively. The overall survival was also significantly better in the low-risk group (P = .022 and P < .001). We provide a formula to calculate a risk score for individual patients based on the model. In conclusion, we developed a prognostic model for PFS combining radiomics and clinical features in a large cohort of patients with R/R cHL. This model calculates a PET-based risk profile and can be applied to develop risk-stratified treatment strategies for patients with R/R cHL. These trials were registered at www.clinicaltrials.gov as #NCT02280993, #NCT00255723, and #NCT01508312.
Introduction
Classical Hodgkin lymphoma (cHL) mainly affects young adults.1 Treatment consists of chemotherapy and radiotherapy and is successful in most cases.2 However, ∼10% to 20% of patients still relapse or are primary refractory, of whom about ∼50% to 60% can be cured with salvage chemotherapy and autologous stem cell transplantation (ASCT). The remaining 40% to 50% generally have a very poor prognosis.3,4 Risk profiling at baseline before starting second-line treatment could be used to identify patients with a high risk of progression, for whom novel (immune) therapies can be considered before the start of salvage chemotherapy, instead of adapting treatment to response assessment after reinduction therapy. 18F-Fluorodeoxyglucose (FDG) positron emission tomography (PET) computed tomography (CT)–adapted treatment has improved outcomes for patients with newly diagnosed cHL.5-7 Although the prognostic value of a complete metabolic response (CMR) before ASCT in patients with relapsed/refractory (R/R) cHL is well known, there is currently no PET-adapted treatment strategy that is widely applied in the salvage treatment setting.8-10
Metabolic tumor volume (MTV) is increasingly studied in cHL and has shown moderate prognostic value as a single biomarker.9,11-16 In most studies, a different cutoff for MTV is used without validation of results, which impedes the use of MTV as a prognostic marker. The Dmax, ie, the maximum distance between 2 lesions, provides another quantitative PET feature, which has shown prognostic value for newly diagnosed cHL.17,18 Radiomics is an emerging field of research that uses high-throughput imaging-based data to extract quantitative image features from a predefined volume of interest (VOI), such as FDG-avid tumors on a PET. Differences in FDG intensity of the VOI (tumor), shape, volume, localization, texture, and intratumor and intertumor heterogeneity can be investigated and reinforced with available genomic and clinical data to develop prognostic models.19-22
Only a few studies have assessed radiomics in cHL, but most prognostic models lack validation in an independent cohort.18,23,24 A prognostic radiomics model based on texture features in newly diagnosed cHL showed high prognostic value for predicting refractory disease, but results were not validated in an independent cohort.23 In addition, texture features, which are calculations based on individual voxels, are susceptible to technical variations, especially in small lesions.25 Many patients present with small lesions in the R/R setting because of early detection during follow-up after first-line treatment.10 Therefore, radiomics dissemination and interlesion heterogeneity parameters (eg, the spread or the difference in distance, volume, and FDG uptake between lesions), which are less susceptible to technical variations, could be more suitable for use in disseminated diseases with smaller lesions such as lymphoma.26 Most other prognostic models that have been developed to predict outcomes in the R/R setting, for example, gene expression–based models,27 have not yet been implemented in a prospective clinical trial or clinical practice, which can possibly be explained by high costs and time-consuming analyses. Because PETs are already used in clinical practice, information obtained through radiomics may contribute to more accurately predict outcomes among patients with cHL and can be implemented in clinical practice to guide treatment decisions, which, in turn, may improve clinical outcome.20-22
Materials and methods
Study population
Patients treated within the following 3 clinical trials were included: (1) Kersten et al,28 who investigated a combination of brentuximab vedotin (BV) and dexamethasone, high-dose cytarabine and cisplatin (DHAP) followed by ASCT; (2) Moskowitz et al,9,29 who investigated sequential BV and ifosfamide, carboplatin, and etoposide (ICE), followed by ASCT; and (3) Moskowitz et al,30 who investigated ICE and optional sequential gemcitabine, vinorelbine, and doxorubicin (GVD) for patients with no CMR, followed by ASCT. A complete overview of treatment regimens is provided in supplemental Table 1. All patients were transplant-eligible and had biopsy-proven cHL, and the PET-CT was performed at baseline, that is, before the start of salvage therapy. Patients were excluded if no PET was available or if the follow-up time was <2 years. An overview of reasons for patient exclusion is provided in supplemental Table 2.
All patients provided written informed consent for participation in the clinical trials (NCT02280993, NCT00255723, and NCT01508312), of which the study protocols were approved by institutional review boards and ethics committees of the centers that conducted the trials. For secondary use of data for this analysis, a waiver was obtained from the ethics committee of Amsterdam University Medical Centers, The Netherlands and the Memorial Sloan Kettering Cancer Center, NY.
18F-FDG PET-CTs and quality control
The PET-CT systems used to perform the scans were accredited by the European Association of Nuclear Medicine Research Ltd. (EARL, Europe) or the American College of Radiology (ACR, United States).31 PET-CTs were deidentified at the participating centers and centrally collected. Inclusion criteria were (1) plasma glucose < 11 mmol/L; (2) reconstruction of attenuation-corrected PET according to guidelines described by EARL or ACR; (3) total image activity (in megabecquerel) between 50% and 80% of the total injected FDG activity or a liver standard uptake value (SUV) mean (SUVmean) between 1.3 and 3.0; and (4) availability of essential PET acquisition data and clinical data.31,32
PET segmentation and radiomics feature extraction
Attenuation-corrected PETs were analyzed using the ACCURATE tool, as described before.26,33 We published earlier that segmentation using a fixed threshold of an SUV of 4.0 (SUV4.0 method) is most suitable for application in the clinical practice for patients with cHL.26 However, because this method frequently does not include small lesions with low FDG uptake (SUV < 4.0), we also analyzed scans with a threshold of an SUV of 2.5 (SUV2.5 method) and a combination method (“combimethod”) in which segmentation with an SUV of 4.0 is complemented with a threshold of an SUV of 2.5 for missing lesions with low uptake (ie, SUV < 4.0). Additionally, we analyzed all scans with a relative threshold of 41% of the SUVmax (41max method) for comparison with those of other studies, because this method has also been used frequently in literature.9,11
Only focal extranodal and splenic lesions were included in the VOI. A global increase in the FDG uptake of the spleen or bone marrow was not included in the VOI. Delineations were performed by J.D. under supervision of a nuclear medicine physician (G.J.C.Z. or H.S.). RaCat software was used to extract 18 patient-level dissemination, standard intensity–based, and volume-based features such as MTV, SUV parameters, and total lesion glycolysis (ie, SUVmean multiplied by MTV) from the complete VOI at patient level.34 An overview of all features and their definitions are provided in supplemental Table 3, and examples are given in Figure 1. Dissemination features included several novel features addressing interlesional heterogeneity based on distance, volume, and intensity. Because of the multicenter aspect of this study and the use of different PET systems, we only used robust radiomics features that were not susceptible to technical variations in PET acquisition, such as dissemination features and SUVpeak (ie, the average SUV of 1 mL with the highest FDG uptake) instead of SUVmax (which represents only the SUV of the highest single voxel, therefore being susceptible to image noise). Additionally, because the SUVmean of the liver was used as a standard quality parameter to compare PETs and was also the reference for a Deauville score of 3, we normalized the SUVmean and SUVpeak (ie, the 1 mL with the highest SUV within the VOI) for the liver SUVmean and used the tumor-to-liver ratio (TLR).32,35-37 The liver SUVmean was estimated on a 3-mL sphere in the right upper lobe of the liver.
End point
The primary end point was to develop a prognostic model for 3-year progression-free survival (PFS) using clinical and radiomics features measured at baseline. PFS was defined as the time from enrollment until progression or death from any cause (binary outcomes: 1 = progression or death and 0 = no event at 3 years). The secondary end point was the 3-year overall survival (OS), defined as the time from enrollment until death from any cause.
Statistical analysis
We analyzed the 18 radiomics dissemination features as listed in supplemental Table 3, and MTV, total lesion glycolysis, TLRSUVmean and TLRSUVpeak, and 5 clinical features, that is, age, Ann Arbor stage, extranodal disease, primary refractory disease vs relapsed disease (R/R status), and B symptoms. Radiomics features were log transformed to obtain a linear relationship with the outcome variable. A clinical model was built using only clinical features, a radiomics model was built for each segmentation method, and the final model was built using both clinical and radiomics features using the segmentation method that showed the best performance. We applied a backward feature selection using the stepAIC function of the R package “MASS” version 7.3-53 to select features for each training model and removed features with high multicollinearity. Backward selection was performed separately for each model and could, therefore, result in the selection of different features per model. Cook distance was calculated but identified no extreme outliers. The models were trained using logistic regression on the BV-DHAP and BV-ICE studies (n = 113, training cohort) and validated on the ICE study (n = 69, validation cohort), using the “glm” function of R package “stats” version 4.0.3. Model performance was assessed by calculating the area under the curve (AUC) of the receiver operating characteristics curve on the training and validation cohort, which was also cross-validated on the training cohort using fivefold with 2000 repeats. The significance of the addition of radiomics features to clinical features was calculated using the deltaAUC test of R package “clinfun” version 1.0.15 for comparing the AUC from receiver operating characteristics curves from nested binary regression models.38 The size of the high-risk group was predefined based on the prevalence of PFS events in the training cohort, which was 26 of 113 (23%). The high-risk group was identified by selecting the top 23% of patients with the highest prediction scores. Another cutoff based on the Youden Index of the cross-validation on the training cohort was also explored. The Kaplan-Meier method and log-rank test were used to analyze differences in PFS and OS for the high- and low-risk groups. Positive predictive value (PPV), negative predictive value (NPV), sensitivity, and specificity were calculated, and a Cox proportional hazards regression was performed for the high-risk vs low-risk groups. Statistical analysis was performed using R software version 4.0.3. A P value < .05 was considered statistically significant.
Results
Patient characteristics
In total, 231 patients were treated in the 3 studies, of whom n = 49 were excluded from the analysis. A total of 37 (16%) cases were excluded because the PET was of insufficient quality or not compatible with the analysis software (supplemental Table 2). We included 182 patients in the analysis, of whom n = 113 were included in the training cohort (n = 58 treated with BV-DHAP and n = 55 treated with BV-ICE) and n = 69 in the validation cohort (treated with ICE). Patient characteristics are summarized in Table 1. Most clinical characteristics were well distributed across the training and validation cohorts. However, the training cohort consisted of a higher percentage of patients with B symptoms (P = .011) and patients with stage II disease (P = .004), whereas the validation set had more patients with stage III disease (P = .004).
. | Training . | Validation . | Total . | P value . | |||
---|---|---|---|---|---|---|---|
(n = 113) . | (n = 69) . | (n = 182) . | |||||
No. . | % . | No. . | % . | No. . | % . | ||
Study | <.001 | ||||||
BV-DHAP | 58 | 51 | 0 | 0 | 58 | 32 | |
BV-ICE | 55 | 49 | 0 | 0 | 55 | 30 | |
ICE-GVD | 0 | 0 | 69 | 100 | 69 | 38 | |
Female sex | 61 | 54 | 32 | 46 | 93 | 51 | .319 |
Median age, (range) | 30 (13-65) | 34 (18-66) | 31 (13-66) | .175 | |||
Primary refractory | 55 | 50 | 25 | 37 | 80 | 45 | .062 |
Ann Arbor stage | .002 | ||||||
I | 10 | 9 | 1 | 1 | 11 | 6 | .042 |
II | 46 | 41 | 43 | 62 | 89 | 49 | .004 |
III | 19 | 17 | 2 | 3 | 21 | 12 | .004 |
IV | 38 | 34 | 23 | 33 | 61 | 34 | .970 |
Extranodal disease | 44 | 39 | 25 | 36 | 69 | 38 | .715 |
B symptoms | 28 | 25 | 7 | 10 | 35 | 20 | .011 |
. | Training . | Validation . | Total . | P value . | |||
---|---|---|---|---|---|---|---|
(n = 113) . | (n = 69) . | (n = 182) . | |||||
No. . | % . | No. . | % . | No. . | % . | ||
Study | <.001 | ||||||
BV-DHAP | 58 | 51 | 0 | 0 | 58 | 32 | |
BV-ICE | 55 | 49 | 0 | 0 | 55 | 30 | |
ICE-GVD | 0 | 0 | 69 | 100 | 69 | 38 | |
Female sex | 61 | 54 | 32 | 46 | 93 | 51 | .319 |
Median age, (range) | 30 (13-65) | 34 (18-66) | 31 (13-66) | .175 | |||
Primary refractory | 55 | 50 | 25 | 37 | 80 | 45 | .062 |
Ann Arbor stage | .002 | ||||||
I | 10 | 9 | 1 | 1 | 11 | 6 | .042 |
II | 46 | 41 | 43 | 62 | 89 | 49 | .004 |
III | 19 | 17 | 2 | 3 | 21 | 12 | .004 |
IV | 38 | 34 | 23 | 33 | 61 | 34 | .970 |
Extranodal disease | 44 | 39 | 25 | 36 | 69 | 38 | .715 |
B symptoms | 28 | 25 | 7 | 10 | 35 | 20 | .011 |
The median follow-up time was 42.4 months (range, 25.5-82.6 months) for the training cohort and 72.3 months (range, 25.5-146.5 months) for the validation cohort. In the training and validation cohort, 26 (23%) and 22 (32%) patients had a 3-year PFS event, and 9 (8%) and 15 patients (22%) died, of whom only 2 (2%) and 1 (1%) died without progressive disease, respectively.
Clinical model
The following clinical patient characteristics were used at time of relapse: age, Ann Arbor stage, presence of extranodal disease, B symptoms, and R/R status. Backward feature selection resulted in selection of 3 variables: stage, B symptoms, and R/R status. The clinical model yielded a cross-validated AUC of 0.729 in the training cohort and an AUC of 0.677 in the validation cohort (Table 2; supplemental Table 4).
Model . | Features∗ . | AUC training cohort (95% CI) . | CV-AUC training cohort (95% CI) . | AUC validation cohort (95% CI) . |
---|---|---|---|---|
Clinical | Ann Arbor stage B symptoms R/R status | 0.787 (0.692-0.883) | 0.729 (0.724 - 0.734) | 0.677 (0.535-0.819) |
Radiomics SUV 4.0 | No. of lesions VolSpread TLRSUVmean | 0.719 (0.605-0.833) | 0.691 (0.685-0.696) | 0.721 (0.580-0.863) |
Final model | R/R status B symptoms MTV Spread TLRSUVmean | 0.837 (0.744-0.930) | 0.810 (0.805-0.814) | 0.750 (0.627-0.872) |
P value of clinical vs final model† | .00094 | .0049 | <.0001 |
Model . | Features∗ . | AUC training cohort (95% CI) . | CV-AUC training cohort (95% CI) . | AUC validation cohort (95% CI) . |
---|---|---|---|---|
Clinical | Ann Arbor stage B symptoms R/R status | 0.787 (0.692-0.883) | 0.729 (0.724 - 0.734) | 0.677 (0.535-0.819) |
Radiomics SUV 4.0 | No. of lesions VolSpread TLRSUVmean | 0.719 (0.605-0.833) | 0.691 (0.685-0.696) | 0.721 (0.580-0.863) |
Final model | R/R status B symptoms MTV Spread TLRSUVmean | 0.837 (0.744-0.930) | 0.810 (0.805-0.814) | 0.750 (0.627-0.872) |
P value of clinical vs final model† | .00094 | .0049 | <.0001 |
Spread, the sum of the distances between all lesions; TLRSUVmean, tumor-to-liver ratio of the lesion SUVmean and liver SUVmean; VolSpread, the sum of differences in volume between all lesions.
All radiomics variables are log transformed.
P values represent the added value of the radiomics features to the clinical model. P value of the cross-validated (CV)-AUC represents the median P value of 2000 repeats of fivefold of cross-validation.
Radiomics model with different segmentation methods
For each tumor segmentation method, that is, SUV4.0, SUV2.5, 41max, and combimethod, the backward feature selection was performed on all features as listed in supplemental Table 3. This resulted in 4 prognostic models with different features for each method, of which the SUV4.0 method yielded a cross-validated AUC of 0.691 and highest validated AUC of 0.721 (Table 2; supplemental Figure 1B-E). The AUC values of the SUV2.5 method were comparable with those of the SUV4.0 method, whereas the 41max method yielded lower AUC values (supplemental Table 4). The model of the combimethod, in which segmentation using a threshold of SUV4.0 was combined with a threshold of SUV2.5 for missing lesions with low uptake, did not result in higher AUC values compared with the separate SUV4.0 or SUV2.5 models (validated AUC of 0.712 vs 0.721 and 0.714, respectively). To rule out differences in model performance because of backward feature selection, the features of the SUV4.0 model selection were also tested on the other methods, but this did not increase the AUCs of these models (supplemental Table 4). Because of high AUC values and a technical validation in an earlier publication, the SUV4.0 method was chosen as the tumor segmentation method for the final model.26
Combined prognostic model
For the final prognostic model, backward feature selection was performed using all radiomics features from segmentations with the SUV4.0 method in combination with clinical features. Backward selection resulted in selection of the following features: R/R status, B symptoms, MTV, sum of all distances between all lesions (Spread), and TLRSUVmean, and yielded a high cross-validated AUC of 0.810 in the training cohort and an AUC of 0.750 in the validation cohort (Tables 2 and 3). The addition of radiomics features (MTV, Spread, and TLRSUVmean) to clinical features showed significant improvement of the AUC in the cross-validated training (P = .0049) and validation (P < .0001) cohorts (Table 2). Ann Arbor stage was not part of the final prognostic model because it was being outperformed by the radiomics feature Spread. Replacing Spread for stage resulted in a lower prognostic value for the model (data not shown). Logistic regression results of the model are shown in Table 3.
Features . | Estimate . | SE . | Z value . | P value . |
---|---|---|---|---|
Intercept | -2.5 | 2.0 | -1.2 | 0.219 |
R/R (relapsed) | -2.5 | 0.7 | -3.8 | 0.000 |
B symptoms | 1.0 | 0.7 | 1.5 | 0.136 |
MTV | -0.4 | 0.2 | -1.6 | 0.118 |
Spread | 0.4 | 0.2 | 2.7 | 0.007 |
TLRSUVmean | 2.4 | 1.0 | 2.4 | 0.018 |
Features . | Estimate . | SE . | Z value . | P value . |
---|---|---|---|---|
Intercept | -2.5 | 2.0 | -1.2 | 0.219 |
R/R (relapsed) | -2.5 | 0.7 | -3.8 | 0.000 |
B symptoms | 1.0 | 0.7 | 1.5 | 0.136 |
MTV | -0.4 | 0.2 | -1.6 | 0.118 |
Spread | 0.4 | 0.2 | 2.7 | 0.007 |
TLRSUVmean | 2.4 | 1.0 | 2.4 | 0.018 |
Logistic regression results of features in the baseline model. Formula of the model: -2.472 – [2.478 ∗ (Relapsed=1, refractory=0)] + [1.010 ∗ (B symptoms = 1, no B symptoms =0)] – [0.384 ∗ log(MTV in uL)] + [0.413 ∗ log(Spread)] + [2.409 ∗ log(SUVmean / liverSUVmean)].
MTV, metabolic tumor volume; R/R, relapsed/refractory; SE, standard error; spread, sum of all distances between all lesions; TLRSUVmean, tumor-to-liver ratio of lesion standard uptake value (SUV) mean and the liver.
Based on the predefined cutoff (23% of PFS events), the high-risk group in the training cohort showed a significant inferior PFS compared with patients in the low-risk group, with a 3-year PFS of 38.1% (95% confidence interval [CI], 23-62) vs 88.4% (95% CI; 82-95; P < .0001), respectively (Figure 2A; supplemental Table 5). Three-year PFS in the independent validation cohort was 38.5% (95% CI, 19-77) vs 75.0% (95% CI, 65-87; P = .0153) for the high- and low-risk groups, respectively (Figure 2B). The 3-year OS was also significantly different between the high- and low-risk groups in the training and validation cohorts (Figure 2C-D). The PPV and NPV for the prediction of 3-year PFS were 61.5% and 88.5%, respectively, in the training cohort, and 61.5% and 75.0%, respectively, in the validation cohort. The PPV and NPV were similar between the 2 studies in the training cohort, that is, the BV-DHAP and BV-ICE studies (Table 4). Results using another exploratory cutoff based on the Youden Index on the crossvalidation of the training cohort did not improve the PPV and NPV (supplemental Table 5).
High- vs low-risk . | Training . | Validation . | BV-DHAP . | BV-ICE . |
---|---|---|---|---|
Sensitivity | 61.5 | 36.4 | 61.5 | 61.5 |
Specificity | 88.5 | 89.4 | 93.3 | 83.3 |
PPV | 61.5 | 61.5 | 72.7 | 53.3 |
NPV | 88.5 | 75.0 | 89.4 | 87.5 |
High- vs low-risk . | Training . | Validation . | BV-DHAP . | BV-ICE . |
---|---|---|---|---|
Sensitivity | 61.5 | 36.4 | 61.5 | 61.5 |
Specificity | 88.5 | 89.4 | 93.3 | 83.3 |
PPV | 61.5 | 61.5 | 72.7 | 53.3 |
NPV | 88.5 | 75.0 | 89.4 | 87.5 |
Performance of the model shown for the training and validation cohorts. The training cohort consists of the BV-DHAP and BV-ICE studies of which the model performance is also shown separately. The optimal cutoff for high- vs low-risk groups is based on the percentage of PFS events in the training cohort which was 23%.
BV, brentuximab vedotin; DHAP, dexamethasone, high-dose cytarabine and cisplatin; ICE, ifosfamide, carboplatin, etoposide; NPV, negative predictive value; PPV, positive predictive value.
In the training cohort, the CMR rate before ASCT was significantly higher in the low-risk group compared with that of the high-risk group (86% vs 69%; P = .049), but this was not the case in the validation cohort. Before ASCT, negative PET result rates seemed higher in the validation cohort because more patients with a positive PET results were excluded during the quality check of PETs. Furthermore, significantly more patients had progressive disease after ASCT in the high-risk groups of both training and validation cohorts (supplemental Table 6).
Correlations of clinical and radiomics features
In Figure 3, several radiomics features that were used in the models are stratified for patients with or without a PFS event. MTV was not significantly higher in patients with an event (P = .12) (Figure 3A). However, MTV still contributed to the prognostic value of the model because of a complex interaction term with Spread and TLRSUVmean, in which Spread showed a higher prognostic value when MTV was high, and in contrast, TLRSUVmean showed a higher prognostic value when MTV was low (Table 3; supplemental Figure 2). A possible explanation for this interaction between Spread, MTV, and TLRSUVmean is that when MTV and Spread are low, a high TLRSUVmean indicates a more aggressive disease and has a worse prognosis, whereas when MTV and Spread are high, the TLRSUVmean is less relevant and the spread of the disease becomes more important to indicate a worse prognosis.
Most radiomics features show moderate to high correlations with other radiomics features. TLRSUVmean, which is included in the model, shows the lowest correlations with other radiomics features (Figure 3I). Ann Arbor stage was significantly correlated with MTV, Spread, Dmax, and the number of lesions but not with TLRSUVmean (Figure 3J-N). Patients with B symptoms had significantly higher values of several radiomics features (Figure 3O-S). R/R status did not correlate with any of the radiomics features (data not shown).
Patients with low and high prediction scores
Examples of patients with low and high prediction scores are provided in Figure 4. The formula for the prognostic model can be found in the description of Table 3. Additionally, we created a calculator in Excel format that can be used to calculate the predicted probability for individual patients (supplemental Appendix). For example, patient B from Figure 4 had relapsed disease with no B symptoms, an MTV of 11 mL, Spread of 411 cm, and TLRSUVmean of 2.43, and the model calculated a risk score of 0.05, which is placed in the low-risk group. Correspondingly, this patient had a CMR before ASCT and is still in remission after 41 months of follow-up. In contrast, patient C had primary refractory disease with B symptoms, an MTV of 24 mL, Spread of 677 cm, and TLRSUVmean of 4.59, corresponding to a risk score of 0.88, and relapsed 3 months after ASCT despite an initial CMR on the pre-ASCT PET.
Discussion
There is an unmet need for better risk stratification in the R/R setting for patients with cHL receiving salvage therapy followed by ASCT.4 Therefore, we have developed a novel prognostic model in R/R cHL for 3-year PFS based on quantitative features from baseline PET scans and clinical characteristics that was validated on an independent data set. The features that were included in our model are robust and not sensitive to technical variations, which makes it more feasible to implement in clinical practice because of the use of different quality of PET scanners across different countries or hospitals.
Several studies have developed prognostic models based on clinical characteristics and pre-ASCT response assessment to predict post-ASCT outcomes, but models for risk profiling at baseline, before starting second-line treatment, are scarce.8,39 A baseline model has the advantage of being able to preselect patients with high-risk disease and change therapy upfront, preventing these patients from not responding to salvage chemotherapy while being at risk of toxicity. The PPV was 61.5% in both the training and validation cohorts, which is similar or slightly higher than the PPV of pre-ASCT response assessment by PET, as described in literature (PPV ranges between 40% and 60% for individual studies), but with our model, this prediction can already be done at baseline.9,10,30,40 Therefore, it is worthwhile to explore the model’s applicability for changing the salvage therapy in patients with R/R cHL with high-risk disease, such as treating these patients, who are most likely to be chemotherapy resistant, with checkpoint inhibitors.41 In addition, our model showed a high NPV in both the training and validation cohorts, which means that the model is also suitable for selecting patients with a low risk of progression. This could be used to guide the selection of patients who can potentially be cured by replacing the ASCT for a less toxic consolidation with checkpoint inhibitors, as is currently being evaluated in several studies.42
A limitation of our analysis is that 32 PETs (16% of total cohort) had to be excluded from the analysis because of inefficient quality of the PETs or because the PET format was not compatible with our analysis software. These scans mainly originated from the ICE study,30 which enrolled patients between 2007 and 2010, when the use of PETs was just emerging in clinical practice. Because the quality of PETs has much improved over the years, it is expected that the percentage of excluded PETs will be much lower in future trials.
Not all patient characteristics were balanced between the cohorts, which could have influenced the performance of the model on the validation cohort. Because the model was trained using the training cohort, the validation cohort showed lower AUC values. However, the cross-validated AUC of the training cohort closely resembled the AUC of the validation cohort, and the PPV was similar between the cohorts. This indicates that in both the training and validation cohorts, patients with high-risk disease were well identified. In addition, the ICE study had a higher number of events, possibly because this study did not treat patients with BV and the study was conducted ∼10 years before the BV-ICE and BV-DHAP studies, so advances in supportive care could have improved over time. In the 3 patient cohorts that we used in this analysis, all patients were intended to receive salvage chemotherapy followed by ASCT, but the salvage chemotherapy schedules were different. Many different salvage regimens are used in R/R cHL across different countries but are generally comparable in terms of efficacy. Therefore, it is also useful to validate the model with different treatment regimens so that it can be extrapolated to other salvage regimens.
We used 3 different semiautomatic segmentation methods, which we have investigated earlier.26 In our previous study, results with the SUV4.0 and SUV2.5 methods highly correlated and with these methods, there was the lowest need for manual adaptation during segmentation.26 In the current analysis, the SUV4.0 method yielded the highest validated AUC score for the prognostic model and was again also the least time-consuming method. The combimethod (SUV4.0 + SUV2.5) that we tested did not improve the prognostic value of radiomics features. A possible explanation for this may be that low FDG-avid lesions (SUV < 4.0) are reactive to the lymphoma and do not substantially contribute to the disease characteristics and, therefore, have no influence on the prognostic capabilities of the radiomics features. In this study, we have confirmed the findings of our previous analysis in a larger cohort of patients and propose to use SUV4.0 as a standard segmentation method for cHL at baseline assessment.26
Other studies investigating quantitative PET features in cHL mainly focused on only MTV and were performed in single cohorts without validation in external cohorts.9,11-16 Besides, most studies used a cutoff for MTV instead of the continuous variable in a logistic regression. In our model, MTV was not the highest contributing factor; therefore, we think that combining MTV with other quantitative PET features, for example, intensity and dissemination features, is important because this enables capturing differences between patients with localized bulky disease and those with disseminated disease.
Other biomarkers, such as circulating tumor DNA and thymus and activation regulated chemokine (TARC) have been shown to correlate with MTV.43 Circulating tumor DNA seems a promising biomarker for detecting minimal residual disease, but its prognostic value at baseline is modest and comparable with that in studies that investigated MTV as a single biomarker.44 We previously published an analysis of TARC levels in 65 patients with R/R cHL (who are also included in this analysis), in which we demonstrated that TARC has a high prognostic value after 1 cycle of chemotherapy, but it provides no prognostic value at baseline.10 Combination of TARC and pre-ASCT SUVpeak increased the accuracy of predicting progression; therefore, combining biomarkers could possibly enhance the prognostic capacities of biomarker models.
PET-CT is already being performed as part of standard clinical practice in most countries.37 We showed earlier that semiautomatic segmentation using the SUV4.0 method requires the least manual adaptation by a nuclear medicine physician and is, thus, less observer dependent. Therefore, quantitative analysis of PETs can be used in clinical practice at low extra costs and will probably not be very time consuming. With upcoming technological advances, such as automated segmentation, it is expected that PET radiomics analysis can be performed much easier in the future.45 Our model consists of robust quantitative PET features, which prevents a high variability of features between different PET scanners, hospitals, and observers. Therefore, quantitative PET analysis provides a promising method for prognostication, which is feasible to be implemented in prospective baseline risk-adapted clinical trials.
Acknowledgments
The authors thank the patients and collaborating investigators who kindly supplied their data.
This work was financially supported by Dutch Foundation of hemato-oncological research (http://www.steunhematologie.nl/), which is a nonprofit donation fund of Amsterdam UMC.
The sponsors for this work had no role in gathering, analyzing, or interpreting the data and outcomes described in the manuscript.
Authorship
Contribution: J.D., R.B., and J.M.Z. designed the study; J.D. performed the PET segmentation under supervision of G.J.C.Z. and H.S.; G.J.C.Z. and H.S. reviewed the staging and response assessment of the PETs; J.D. performed the statistical analysis and drafted the manuscript, with contributions from all authors; and all authors collected and interpreted the data, and read, commented on, and approved the final version of the manuscript.
Conflict-of-interest disclosure: R.B. serves as a scientific adviser and chair of the EARL accreditation program. M.J.K. declares consultancy for Bristol Myers Squibb/Celgene, Kite/Gilead, Miltenyi Biotech, Novartis, and Takeda; receives honoraria from Kite/Gilead, Novartis, and Roche; and receives research funding from Kite/Gilead and Takeda. C.H.M. declares research support from Seagen. A.J.M. declares consultancy for Takeda, Imbrium Therapeutics, Janpix, Merck, and Seagen, and receives research funding from Incyte, Merck, Seagen, ADC Therapeutics, BeiGene, Miragen, and Bristol Myers Squibb. J.M.Z. receives research funding from Takeda and Roche, and declares consultancy for Karyopharm. The remaining authors declare no competing financial interests.
Correspondence: Julia Driessen, Department of Hematology, Amsterdam UMC, University of Amsterdam, Meibergdreef 9, 1105 AZ Amsterdam, The Netherlands; e-mail: j.driessen@amsterdamumc.nl.
References
Author notes
Researchers may request access to certain deidentified data and related study documents by contacting the corresponding author, Julia Driessen (j.driessen@amsterdamumc.nl).
The full-text version of this article contains a data supplement.