Key Points
PET-CT is the modern standard for staging Hodgkin lymphoma and can replace contrast enhanced CT in the vast majority of cases.
Agreement between expert and local readers is sufficient for the Deauville criteria to assess response in clinical trials and the community.
Abstract
International guidelines recommend that positron emission tomography-computed tomography (PET-CT) should replace CT in Hodgkin lymphoma (HL). The aims of this study were to compare PET-CT with CT for staging and measure agreement between expert and local readers, using a 5-point scale (Deauville criteria), to adapt treatment in a clinical trial: Response-Adapted Therapy in Advanced Hodgkin Lymphoma (RATHL). Patients were staged using clinical assessment, CT, and bone marrow biopsy (RATHL stage). PET-CT was performed at baseline (PET0) and after 2 chemotherapy cycles (PET2) in a response-adapted design. PET-CT was reported centrally by experts at 5 national core laboratories. Local readers optionally scored PET2 scans. The RATHL and PET-CT stages were compared. Agreement among experts and between expert and local readers was measured. RATHL and PET0 stage were concordant in 938 (80%) patients. PET-CT upstaged 159 (14%) and downstaged 74 (6%) patients. Upstaging by extranodal disease in bone marrow (92), lung (11), or multiple sites (12) on PET-CT accounted for most discrepancies. Follow-up of discrepant findings confirmed the PET characterization of lesions in the vast majority. Five patients were upstaged by marrow biopsy and 7 by contrast-enhanced CT in the bowel and/or liver or spleen. PET2 agreement among experts (140 scans) with a κ (95% confidence interval) of 0.84 (0.76-0.91) was very good and between experts and local readers (300 scans) at 0.77 (0.68-0.86) was good. These results confirm PET-CT as the modern standard for staging HL and that response assessment using Deauville criteria is robust, enabling translation of RATHL results into clinical practice.
Introduction
Positron emission tomography (PET) and PET-computed tomography (CT), using 2-deoxy-2-[18F]fluoro-d-glucose (FDG), has been extensively used for imaging patients with Hodgkin lymphoma (HL).1-5 International guidelines recently recommended that PET-CT be used for routine staging of FDG-avid lymphomas and for response assessment using a 5-point scale (5-PS), the so-called Deauville criteria.6,7
PET-CT was preferred for staging due to improved accuracy compared with CT and as a baseline for subsequent response assessment.6 Contrast enhanced CT (ceCT) may be required if accurate nodal measurement is needed (eg, in clinical trials, assessment of bowel involvement, compression/thrombosis of central vessels, and for radiation planning).7 Direct comparison between ceCT and PET-CT when the CT component is performed as a low-dose unenhanced scan suggests that higher dose ceCT has no impact on lymphoma management.8,9 Further, changes in FDG uptake are more relevant than changes in nodal size for response assessment.6 Despite this, both ceCT and PET-CT are frequently performed at diagnosis, with added cost and radiation exposure.
PET is reported to alter stage compared with CT,3,10 but most publications are retrospective and used stand-alone PET.11-16 Previous publications compared imaging techniques but do not report the impact of imaging on the more relevant final clinical stage, inclusive of bone marrow biopsy, and some publications included patients with HL and non-Hodgkin lymphoma (NHL).17-21
Response assessment with the 5-PS has been reported by a number of investigators to have good interobserver agreement in a training set of 50 patients,22 in a study of 260 advanced HL patients with international expert readers,23 and in pediatric HL.24 The 5-PS also improves the positive predictive value of PET compared with previous International Harmonization Criteria.25 These studies, however, were all retrospective, and the 5-PS was not used to direct therapy.
The aims of this study were to determine, in a large cohort of patients with advanced HL, within a prospectively acquired clinical trial, the difference in staging when unenhanced PET-CT is used in place of standard assessment with clinical examination, ceCT, and bone marrow biopsy and the agreement among experts and between experts and local readers using the 5-PS to adapt treatment in real time.
Methods
Patients were registered in the Response Adapted Therapy in Advanced Hodgkin Lymphoma (RATHL) study and gave written informed consent in accordance with the Declaration of Helsinki. The study had Human Ethics Committee approval in all participating countries.
Staging
Patients underwent clinical assessment and ceCT of the neck, thorax, abdomen, pelvis, and bone marrow biopsy to assess stage. RATHL included patients with stages IIB to IV and stage IIA with adverse features. Patients also underwent a PET-CT scan with low-dose unenhanced CT at staging (PET0). ceCT and PET-CT scans were performed within 28 days of enrollment. PET-CT scans were acquired around 60 minutes after the intravenous injection of 350 to 550 MBq FDG. The ceCT scans at diagnosis were reported by the radiologist at the recruiting center, and clinical assessment, CT findings, and bone marrow biopsy were used to assign the final RATHL stage by the treating clinician on the case report form. In individual cases, ultrasound or magnetic resonance imaging was also used for staging. Inclusion in the study was based on the RATHL stage and not PET-CT. The PET-CT stage was assigned by central readers at core labs in the United Kingdom, Italy, Sweden, Denmark, and Australia without knowledge of RATHL stage, marrow biopsy, or patient outcome. Causes for discrepancy in stage between PET-CT and other modalities were assessed with reference to the imaging reports, bone marrow biopsy results, and by observing the changes that occurred with treatment on PET scans performed during chemotherapy. ceCT scans were not re-reviewed centrally. PET-CT scans were performed at multiple international centers using standardized methods for acquisition and quality control.26
Response assessment
PET-CT was repeated after 2 cycles of doxorubicin, bleomycin, vinblastine, dacarbazine (ABVD) chemotherapy (PET2). All scans were PET-CT, and stand-alone PET scans were not permitted in the trial. PET2 scans were performed 9 to 13 days after day 15 of cycle 2.
PET2 was scored using the 5-PS,6 according to the level of the most intense residual FDG uptake at involved sites on PET0 as follows: (1) no uptake; (2) uptake ≤mediastinum; (3) uptake >mediastinum but ≤liver; (4) uptake moderately higher than liver; and (5) uptake markedly higher than liver and/or new lesions. The term “X” was applied to new areas of uptake unlikely to be related to lymphoma.
Scores 1, 2, and 3 were regarded as negative, and scores 4 and 5 were regarded as positive. Score 5 was regarded as uptake ≥3 times the maximum standardized uptake value in normal liver.
Patients with negative PET2 were randomized to continue with 4 cycles of ABVD or have de-escalation of treatment with 4 cycles of AVD (doxorubicin, vinblastine, dacarbazine). Patients with positive PET2 had escalated treatment with bleomycin, etoposide, doxorubicin, cyclophosphamide, vincristine, procarbazine, and prednisone (BEACOPP)-escalated or BEACOPP-14, according to the treating center preference. A third PET-CT (PET3) was performed 9 to 13 days after day 8 of cycle 3 of BEACOPP-escalated or 2 to 6 days after day 8 of cycle 4 of BEACOPP-14 to ensure effectiveness of therapy. It was left to the treating physician’s discretion whether patients were offered salvage therapy if PET3 remained positive. An end-of-treatment PET scan was not mandated. Patients with a negative PET2 or PET3 scan were not recommended to receive consolidation radiotherapy as routine, although local investigators had the discretion to use radiotherapy if they felt it necessary.
PET2 was scored at the core laboratories within 72 hours, and this score was used to direct treatment. To assess the level of agreement, readers from all core laboratories read the same paired PET0 and PET2 scans from (1) a training set of 50 patients,22 (2) the first 10 patients scored at each core laboratory, and (3) a further 10 patients scored at each core laboratory during the trial. Readers at local PET centers were given the option to score scans. Levels of agreement were measured between central (core laboratory) readers and between local and central readers, using nonweighted κ statistics27 (Stata version 12.1; Stata Corp). Using this threshold, uptake in lesions higher than normal liver uptake resulted in treatment escalation. Agreement was also measured regarding scores 1 and 2 as negative and scores 3 to 5 as positive, as this threshold has been used as a benchmark for de-escalation in trials involving patients with HL.1,28,29
κ values between 0.81 and 1.00 indicate very good agreement, 0.61 and 0.80 indicate good agreement, 0.41 and 0.60 indicate moderate agreement, and 0.21 to 0.4 indicate fair agreement.30
Results
Staging
A total of 1214 patients were registered from 2008 to 2012; 1171 baseline PET-CT scans were available for staging assessment, which was performed retrospectively (Figure 1).
There was agreement between the RATHL stage and the PET-CT stage in 938 (80%) patients. A total of 159 patients (14%) were upstaged and 74 patients (6%) were downstaged by PET-CT (Table 1). The main reason for upstaging was detection of extranodal disease (Table 2), most commonly in bone marrow (Figure 2). Upstaging due to nodal involvement also occurred, mostly below the diaphragm (Table 2). Reasons for downstaging included enlarged nodes and/or spleen, which were not FDG-avid and extranodal sites with abnormal morphology but no FDG uptake (Table 2).
The PET2 scans of patients with discrepant staging findings were compared with the PET0 scan (supplemental Table 1, available on the Blood Web site). At PET2, FDG uptake at sites that resulted in upstaging decreased in parallel with other sites of disease during treatment in all cases (supplemental Table 1). Twenty patients had extranodal lesions on CT that were not FDG-avid. Five of these 20 patients had lesions that did not change on treatment (Figure 3) and were considered unlikely to represent lymphoma (1 adrenal adenoma, 3 lung nodule/s, 1 bone lesion). One patient had a 37-mm cavitating lung nodule on CT, which enlarged from 5-mm 7 days prior on PET-CT and was probably inflammatory. Six patients had indeterminate lung nodules, and 1 patient had lobar consolidation that all resolved; 3 patients had pleural effusions that resolved; in all these cases, the changes may have been reactive, inflammatory, or related to lymphoma. There were 4 patients, small bowel (1), liver lesions (2), and 1 patient with bowel and liver lesions, where ceCT was considered more likely to indicate the correct stage. Three patients had splenic lesions on ceCT not seen on PET. Bone marrow involvement was missed on PET-CT but identified on biopsy in 5 patients.
There were 21 cases where review of the reports suggested that the same imaging findings had been differently interpreted by the local radiologist and the core laboratory PET CT reader or the local radiologist and the treating clinician (Table 2).
Reporting of early response assessment
A total of 1123 PET2 scans were assessable in the trial (Figure 1); 223 PET2 scans were performed in the PET center of one of the core laboratories (Figure 1). Local readers scored 300 of the remaining 900 scans (33%). One hundred forty PET2 scans were scored by all readers at the 5 core laboratories.
When the liver threshold was used, there was agreement between core laboratories that a scan was negative (score 1-3) or positive (score 4 or 5) in 122 of 140 scans: κ (95% confidence interval [CI]), 0.84 (0.76-0.91), indicating very good agreement. There was agreement between central and local readers in 276 of 300 scans; κ (95% CI): 0.77 (0.68-0.86; Table 3), indicating good agreement.
When the mediastinal threshold was used, there was agreement between core laboratories that a scan was negative (score 1 or 2) or positive (scores 3-5) in 81 of 140 scans: κ (95% CI), 0.58 (0.50-0.66), indicating moderate agreement. There was agreement between central and local readers in 249 of 305 scans: κ (95% CI), 0.64 (0.55-0.73; Table 4), indicating good agreement.
Identical scores ranging from 1 to 5 were given in 169 of 300 PET2 scans by central and local readers (Table 5). When assigning scans as positive or negative, 15 PET2 scans scored by local readers as positive were deemed negative by central review, whereas 9 PET2 scans scored by local readers as negative were deemed positive by central review using the liver threshold that determined treatment.
One hundred sixty one PET3 scans were performed; 26 were performed at the PET center of one of the core laboratories (Figure 1). Local readers scored 47 of the remaining PET3 scans (36%). Using the liver threshold, there was agreement among central and local readers that a scan was negative or positive in 45 of 47 scans: κ (95% CI), 0.91 (0.78-1.00), indicating very good agreement. For the mediastinal threshold, there was agreement in 39 of 47 scans: κ (95% CI), 0.61 (0.45-0.87), indicating good agreement.
Identical scores ranging from 1 to 5 were given in 24 of 47 PET3 scans by central and local readers using the liver threshold.
Discussion
Our study is the first to compare PET-CT staging with the established standard of clinical assessment, contrast-enhanced CT, and bone marrow biopsy stage in a large cohort of patients with advanced HL in an international trial. PET-CT altered staging in 20% of patients compared with the standard approach, which is at variance with earlier reports that suggested stage change occurred more often in patients with early stage rather than advanced disease.10,20 Upstaging occurred more frequently than downstaging, with extranodal disease accounting for 74% of the upstaged scans, mostly due to an increased sensitivity of PET for detecting bone marrow involvement.
Baseline and response scans were compared, and the findings were correlated with other imaging, where available, to determine the etiology of lesions that accounted for the discrepancy in stage. This supported the notion that PET-CT stage is more accurate than CT in the majority of cases. There were only 4 patients with organ involvement and 3 with probable splenic involvement identified on ceCT that were missed on PET-CT. Bone marrow biopsy identified involvement in 5 cases (0.4%), where there was a normal marrow appearance on PET-CT confirming that routine bone marrow biopsy is no longer required.7,31
We cannot determine whether staging by PET-CT impacted management because patients were registered in the trial based on having at least stage IIA disease with adverse risk factors on other imaging and marrow biopsy. Upstaging by PET-CT would have been unlikely to impact management as patients had already been assessed as requiring full-course chemotherapy. Similarly, downstaging from stage 4 to stage 3 would not have impacted treatment. PET-CT, however, downstaged 56 patients to stage 2 and 1 patient to stage 1 (5% of the study population), and this could potentially influence treatment choices.
Our results are in keeping with earlier reports11,13-16 that PET stages HL patients differently to ceCT in ∼15% to 30% of cases. Early reports had fewer patients, were retrospective, and were published prior to the widespread introduction of PET-CT.11-16 More recently, Hutchings et al3 compared PET with CT staging prospectively in 99 patients with HL (61 with PET-CT), and Rigacci et al10 in 186 patients (56 with PET-CT). Upstaging, especially by identification of extranodal sites, occurred more frequently than downstaging, similar to our findings.3 Lesions were also overlooked on CT and more easily identified on PET.3,10,20
The truth as to whether lesions identified only on PET represent lymphoma is difficult to determine as biopsy of discrepant lesions is rarely performed. Correlative imaging, treatment response, follow-up, biopsy, or a combination usually serves as a reference standard.3,10-14 Young et al,32 in 49 HL patients, verified stage by laparotomy in 11 patients and biopsy of discordant lesions in the remainder. PET stage was correct in 26 upstaged patients, and a single patient downstaged compared with CT. Taken together, these studies demonstrated improved sensitivity and similar specificity using PET compared with CT.
Changes in management were reported in 12 of 186 patients by Rigacci et al.10 Ten patients were upstaged from limited to advanced disease, and 2 patients had an increase in the radiotherapy field using PET. Hutchings et al3 reported that 7 patients upstaged by PET to advanced disease were treated for limited disease, but only 1 patient developed progressive lymphoma. Eighteen patients had disease progression overall, suggesting that understaging by CT probably did not adversely impact on outcome. The same group later reported, however, that the routine use of PET-CT in their clinical practice resulted in stage migration, with a higher risk of progression associated with focal FDG-avid skeletal lesions.33 Bone marrow involvement was the most common reason for discrepancy in stage between PET and CT in our series. Munker et al13 also reported significantly more treatment failures among patients staged as I or II on CT, yet III or IV on PET compared with patients who were stage I or II by both CT and PET. More accurate delineation of disease is thus likely to be of benefit for patient management and for radiation planning10,15 .
International guidelines have recommended that PET-CT should be used for staging of FDG-avid lymphomas,7 because it is more accurate than CT and because a staging scan improves the accuracy of response assessment.6 The role of ceCT in staging is still debated, although the international guidelines concluded that ceCT had limited application7 as it rarely altered staging or management.8,9 Our study confirms this conclusion; organ involvement was detected on ceCT but not PET-CT in 4 patients and splenic lesions in 3 patients, leading to stage change. On the other hand, PET-CT detected extranodal disease in 118 patients and splenic FDG-avid foci in 7 patients and stage was changed. Further, the effective dose associated with ceCT is ∼16 mSv, which is similar to the combined dose from PET with lower-dose unenhanced CT. Avoiding ceCT for staging would thus reduce a patient’s radiation dose by 50%. Although ceCT may be required for planning radiotherapy, in our study, <10% of patients required this.7
PET-CT is recommended for response assessment in FDG-avid lymphomas using the 5-PS.34 The 5-PS has good interobserver agreement22,24,25,35-37 and is predictive of patient outcomes,29,35,38-40 especially in advanced HL.37,41 An advantage of the 5-PS is that the threshold used to define complete metabolic response can be altered according to the clinical context or research question.34 In the RATHL design where patients with a positive scan received escalation from ABVD to BEACOPP, investigators preferred to use the liver threshold to avoid the risk of overtreating patients. In some trials designed to explore de-escalation strategies, a lower mediastinal threshold has been used to avoid the risk of undertreating patients.1,28,29 Yahalom42 expressed concern that the good agreement reported among expert readers may not be reproducible in the community setting, in particular, if treatment was adapted on the basis of the result. Our study demonstrated similar levels of agreement among experts and local readers, with good or very good agreement for the liver but lower moderate to good agreement for the mediastinal threshold. In RATHL, the outcome for patients with a negative scan was not influenced by the PET2 score,43 and our observations suggest that the liver is likely to be a more reproducible threshold, although readers may possibly have paid closer attention to the decision to assign a score of 4 rather than 3, knowing this would result in treatment escalation. The better agreement using the liver43 may be explained by the higher uptake in liver than mediastinum, so the contrast between lesions with low-grade uptake and the reference region is easier to appreciate.22,23 In addition, there is more uniform uptake in the liver than the mediastinum where uptake can be heterogeneous with focal uptake in the vessel wall. Standardization of patient preparation and scanning are important to ensure homogenous uptake in the liver.44 The agreement for scoring at PET3 was good, which has not previously been reported, although the numbers assessed were small. This is reassuring as BEACOPP chemotherapy can be associated with diffuse bone marrow uptake, which might have made interpretation more challenging.
The main limitations to our study were that it was not possible for ceCT scans to be re-reviewed alongside PET-CT scans, and we could not measure the impact of PET-CT on management. Bias may have occurred in the scoring of response scans by local readers as it was optional to score scans, although readers were evenly split between academic and nonacademic institutions.
In conclusion, staging of HL patients in this large prospective study confirms that an important proportion will be staged differently using PET-CT compared with clinical assessment, CT, and bone marrow biopsy. When discordance occurs in the imaging stage, PET-CT is usually more accurate than CT, which may impact on management.3,10,14,32 These findings support the move to a modern standard using PET-CT for staging and suggests that, in the vast majority of cases, ceCT is not required.6,7
Good agreement between local and expert readers indicates that the 5-PS is robust for assessing response when standardized PET protocols are used, and it works effectively in the community setting and in clinical trials. The final results of the RATHL trial will determine whether response adaptation using the 5-PS is successful at improving outcomes in advanced HL. In the meantime, our results strengthen the application of the 5-PS as the optimal method for response assessment in HL.
Presented in part at the 12th International Conference on Malignant Lymphomas, Lugano, Switzerland, June 19-22, 2013.
The online version of this article contains a data supplement.
The publication costs of this article were defrayed in part by page charge payment. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Acknowledgments
The authors thank Cancer Research UK (CRUK/07/033), the Associazione Angela Serra for Cancer Research (Modena, Italy), Larvik Kreftforening, Norway, and Cancer Australia’s Priority Driven Collaborative Cancer Research Scheme, who provided funding for the RATHL trial. The authors also thank Cancer Research UK, the National Institute for Health Research in England, and the Departments of Health for Scotland, Wales and Northern Ireland (C19631/A16091; MRPTADR) for funding the UK National Cancer Research Institute PET Core Laboratory. The enthusiastic support of PET reviewers, PET centers, trial investigators, patients, and their families is also gratefully acknowledged.
Authorship
Contribution: S.F.B. and P.W.J. designed the research; all authors performed research; S.F.B., A.A.K., T.H.R., L.C.P., M.F., S.L., J.R., J.T., A. Fossa, L.B., D.M., F.D., D.A.S., P.S., and P.W.J. collected data; S.F.B., A.A.K., A. Franceschetto, M.J.F., H.A., E.B., and K.H. analyzed and interpreted data; A.A.K. performed statistical analysis; and all authors wrote and approved the manuscript.
Conflict-of-interest disclosure: P.J. has received paid consultancies for Takeda and Bristol-Myers Squibb. All other authors declare no competing financial interests.
Correspondence: Sally F. Barrington, PET Imaging Centre, St. Thomas’ Hospital, Westminster Bridge Rd, London SE1 7EH, United Kingdom; e-mail: sally.barrington@kcl.ac.uk.