Key Points
PROM total score, PROM total score change, and NIH 0 to 3 skin change are associated with clinician-reported response in cutaneous sclerosis.
Human activity profile AAS, SF36 vitality change, LSS skin, and LSS skin change are associated with patient-reported response.
Visual Abstract
Cutaneous sclerosis, a highly morbid subtype of chronic graft-versus-host disease (GVHD), demonstrates limited treatment response under current National Institutes of Health (NIH) response measures. We explored novel sclerosis-specific response measures using Chronic GVHD Consortium data. A training cohort included patients with cutaneous sclerosis from a randomized trial of imatinib vs rituximab and a consortium observational study. The validation cohort was a different consortium observational study. Clinician-reported measures (baseline and baseline to 6-month change) were examined for association with 6-month clinician-reported response. Patient-reported measures (baseline and baseline to 6-month change) were studied for association with 6-month patient-reported response. A total of 347 patients were included (training 183 and validation 164). Although multiple skin and joint measures were associated with clinician-reported response on univariate analysis, patient range of motion (PROM) total score, PROM total score change, and NIH 0 to 3 skin change were retained in the final multivariate model (area under the receiver operating characteristic curve [AUC], 0.83 training and 0.75 validation). Similarly, many patient-reported measures were associated, but final multivariate analysis retained the human activity profile adjusted activity score (AAS), 36 item short form health survey (SF36) vitality change, Lee symptom scale (LSS) skin, and LSS skin change in the model (AUC, 0.86 training and 0.75 validation). We identified which sclerosis measures have the greatest association with 6-month clinician- and patient-reported treatment responses, a previously unstudied area. However, given the observed performance in the validation cohorts, we conclude that further work is needed. Novel response measures may be needed to optimally assess treatment response in cutaneous sclerosis.
Introduction
Chronic graft-versus-host disease (GVHD) is a serious multisystem immune-mediated disorder after allogeneic hematopoietic cell transplantation (HCT). It is a leading source of late post-HCT death, impaired quality of life and function, increased symptom burden, and prolonged duration of immune suppressive therapy.1-6 Cutaneous sclerosis is a relatively common manifestation of chronic GVHD, characterized by fibrotic change in skin, subcutaneous tissue, and joint/fascia.7 Affected patients may suffer nonmutually exclusive combinations of thickened and/or tight skin epidermis, deep tissue thickening, impairments in joint mobility/function, and thus significant disability. Because responses are limited with available immune suppressive therapies, many will require numerous agents and prolonged overall duration of therapy. Highlighting the pressing need for advances in this area, the 2020 National Institutes of Health (NIH) Chronic GVHD Consensus (working group 4) called for research innovation in this and other highly morbid forms of chronic GVHD.8
Clinical trials in chronic GVHD (such as those that have led to 3 current US Food and Drug Administration–approved agents)9-11 have typically enrolled patients with a diverse range of organ-site manifestations, with a relative underrepresentation of cutaneous sclerosis. In contrast, a few trials have specifically enrolled patients with cutaneous sclerosis: 1 randomized phase 2 Chronic GVHD Consortium trial tested imatinib (n = 35) vs rituximab (n = 37) and demonstrated a 26% to 27% significant clinical response rate at 6 months of therapy, with no significant difference in this outcome per treatment arm.12 This primary outcome was defined by improvements in a sclerotic skin assessment tool (Vienna Skin Scale [VSS]) and range of motion in affected joints (patient range of motion [PROM] scale). Another more recent, multicenter, phase 2, single-arm trial (N = 49) tested ruxolitinib in cutaneous sclerosis.13 Using NIH-defined response in skin or joints at 6 months, 49% achieved a partial response, whereas others had stable or progressive disease. Most of the responses were due to improvements in joint range of motion rather than improvements in sclerotic skin.
Low response rates in cutaneous sclerosis may be driven both by limited responsiveness of the disease to currently available therapies, length of time required for reversal of sclerosis, as well as limited sensitivity of the NIH response measures designed for chronic GVHD overall. Under this response assessment tool, resolution of deep sclerotic changes or major functional improvements in affected joints may be needed to document clinical benefit.14 Additionally, other research has suggested that PROM–based joint/fascia responses may be overly sensitive to both response and progression and affected by interobserver variability.15 There is general recognition that better response measures are needed in cutaneous sclerosis. Research in an allied disorder (the autoimmune condition systemic sclerosis) produced a composite response index (American College of Rheumatology Composite Response Index in Systemic Sclerosis [ACR-CRISS]) that has now been implemented in relevant clinical trials.16 This CRISS model uses 5 core items, including the modified Rodnan Skin Score that has been used in cutaneous sclerosis previously.
The main objective of our current analysis was to define which existing measures of skin and joint disease activity in cutaneous sclerosis were associated with clinician- and patient-reported benefit at 6 months. Two Chronic GVHD Consortium observational studies and 1 interventional trial were leveraged to address this question, including separate training and independent validation.
Methods
Parent study populations
Population 1: prospective clinical trial testing imatinib vs rituximab
This multicenter phase 2 trial enrolled patients with cutaneous sclerosis and randomly assigned therapy with either imatinib or rituximab.12 The trial was conducted from 2011 to 2014 and treated a total of 72 patients from 11 participating centers. The core objective was to assess and compare the efficacy of each agent in cutaneous sclerosis. The primary end point of significant clinical response at 6 months was defined by VSS improvement (≥2 points) and PROM improvement (≥2/7 scale or ≥1/4 scale). Beyond this, extensive data were collected on baseline features, objective measures of sclerotic involvement and changes (NIH chronic GVHD scores, VSS, PROM, and goniometer measures), patient-reported outcome measures, as well as skin and blood biomarkers.
Population 2: 2192 observational cohort study
This multicenter observational study enrolled patients with chronic GVHD, which could be either incident (enrollment within 3 months of chronic GVHD diagnosis) or prevalent (over 3 months from diagnosis and within 3 years of HCT). Recruitment occurred from 2007 to 2012 at 9 HCT centers in the United States, and a total of 601 were enrolled. Data collection occurred at enrollment, 3 months (incident cases only), and every 6 months through 5 years. Clinician- and patient-reported data, treatment information, samples, and functional assessments were serially obtained with the overall objective of validating proposed NIH Consensus measures.
Population 3: 2710 observational cohort study
This multicenter observational study enrolled chronic GVHD–affected patients starting a new systemic therapy for chronic GVHD.17 Recruitment occurred from 2013 to 2019 at 12 US HCT centers, with 383 total enrolled. Data collection occurred at enrollment, 3, 6, and 18 months and at the time of systemic treatment change. With the overall objective of testing the NIH response criteria, the study gathered comprehensive clinician- and patient-reported data, treatment information, samples, and functional assessments.
Institutional review board approval was granted for the original 3 studies (1 trial and 2 observational cohort studies listed in the manuscript) including long-term follow-up analyses (including this report).
Current study population and analysis plan
The following were included in this analysis (supplemental Figure 1). First, all participants from the imatinib vs rituximab clinical trial were included; the original trial eligibility included a diagnosis of cutaneous sclerosis (within 12 months from sclerosis diagnosis) with either sclerotic skin, morphea, myofascial involvement, or join contractures with a score of ≥2 in any area on the Vienna skin scale,18 or a score of ≤5 at the shoulder, elbow, or wrist or a score of ≤3 at the ankle on the PROM scale.19 Second, we included patients from the 2192 and 2710 observational studies based on reported sclerotic skin or joint/fascial involvement at cohort entry. From the parent studies, the following variables were considered for the purpose of defining cutaneous sclerosis: NIH 0 to 3 sclerotic skin scores (inclusive of score 2-3 indicating superficial or deep sclerosis), indication of sclerotic features present, fascial involvement, or VSS grade 3 or 4 for any body region at cohort entry. The imatinib vs rituximab clinical trial and the 2192 study were combined to form the training cohort of this analysis, and the 2710 study formed the independent validation cohort.
The analysis was organized to address 2 parallel questions. First, clinician-assessed sclerosis variables (both baseline and baseline to 6 month change values) were considered for association with 6-month (skin-specific) clinician-reported treatment response. The variables considered included the following: PROM scores (separately at shoulder, elbow, wrist, and ankle, as well as total PROM score),19 total body surface area (BSA) involved (separately for movable and nonmovable sclerotic skin changes), NIH 0 to 3 skin scores, Hopkins skin score (normal, thickened with pockets of normal skin, thickened over majority of skin, thickened and unable to move, and hidebound unable to pinch) and fascia scale (normal, tight with normal areas, tight, and tight and unable to move),20 and total BSA involved with superficial or deep sclerosis per the Vienna skin scale (total VSS for all BSA, not per individual anatomical sites).18 Clinician-assessed response at 6 months was originally captured according to an 8-point scale with categories of “completely gone,” “very much better,” “moderately better,” “a little better,” “about the same,” “a little worse,” “moderately worse,” or “a lot worse.”14 These response categories were collapsed into “improved” (completely gone, very much better, moderately better, and a little better) vs “not improved” (about the same, a little worse, moderately worse, and a lot worse).
Second, patient-reported outcome (PRO) measures (both baseline and baseline to 6-month change values) were considered for association with 6-month (skin-specific) patient-reported treatment response. Patient-reported outcome measures considered included the following: modified scleroderma health assessment questionnaire standard and alternative disability index,21 modified human activity profile (HAP) adjusted activity score,22 SF-36 domain (physical functioning, role physical, bodily pain, general health, vitality, social functioning, role emotional, and mental health) and summary (mental and physical component scores) scores,23 Lee symptom scale (LSS) domain (skin, energy, lung, eye, nutrition, psychological, and mouth), individual question (joint and muscle aches and limited joint movement) and summary (overall summary) scores,24 and Functional Assessment of Cancer Therapy–Bone Marrow Transplant (FACT-BMT) domain (physical, social/family, emotional, and functional well-being; BMT subscale) and summary (Functional Assessment of Cancer Therapy - General [FACT-G], FACT trial outcome index, and FACT-BMT total) scores.25 Patient-response categories were as described above (per clinician-reported response) and similarly collapsed into improved vs not improved summary response categories. These measures are briefly summarized in supplemental Table 3.
Statistical methods
Patient characteristics were summarized and compared across the training and validation cohorts using the χ2 test and Fisher exact test for categorical variables and the Wilcoxon rank sum test for continuous variables. Univariate logistic regression analyses were conducted to examine the association between individual (clinician- and patient-reported variables separately) variables and the 6-month response outcome (clinician- and patient-reported 6 month response, respectively). Both baseline and baseline to 6 month change values were considered for each variable. Multivariate analyses were performed (for the 2 response outcomes separately) using stepwise regression with criteria for entry and stay in the model of P value <.05. Because combined models (incorporating both clinician- and patient-reported variables to model each of our response outcomes) did not improve model performance, we examined only clinical variables in the clinician response model and patient-reported variables in the patient-reported response model. The final performance of each model was reported as an area under the receiver operating characteristic (ROC) curve (AUC), and cut-points were determined (using Youden J = sensitivity + specificity – 1, and minimizing the distance from the ROC curve to the point [0, 1]) to calculate sensitivity, specificity, and positive and negative predictive values.
Results
Considering eligible participants from the 3 parent studies, a total of 347 patients were included in this analysis (flow diagram, supplemental Figure 1). These were divided into a training set (n = 183) and validation set (n = 164). Baseline characteristics of the included patients are listed in supplemental Table 1. A full description of clinician-reported sclerosis variables and PRO measures is presented in Tables 1 and 2, whereas those that were retained in the final multivariate analysis are presented in Table 3. Actual systemic immune suppressive therapies given for cutaneous sclerosis are detailed in supplemental Table 2.
Characteristic . | Total, mean (range) . | Training, mean (range) . | Validation, mean (range) . |
---|---|---|---|
ROM-shoulder | 6.3 (2.0-7.0) | 6.3 (3.0-7.0) | 6.2 (2.0-7.0) |
n | 295 | 138 | 157 |
ROM-elbow | 6.3 (1.0-7.0) | 6.3 (1.0-7.0) | 6.4 (3.0-7.0) |
n | 294 | 139 | 155 |
ROM-wrist | 5.7 (1.0-7.0) | 5.7 (1.0-7.0) | 5.7 (1.0-7.0) |
n | 295 | 138 | 157 |
ROM-ankle | 3.4 (1.0-4.0) | 3.3 (1.0-4.0) | 3.5 (1.0-4.0) |
n | 289 | 137 | 152 |
BSA movable | 11.7 (0.0-77.8) | 11.7 (0.0-77.8) | |
n | 182 | 182 | |
BSA nonmovable | 8.0 (0.0-70.0) | 8.0 (0.0-70.0) | |
n | 182 | 182 | |
Hopkins skin score | 1.8 (0.0-4.0) | 1.8 (0.0-4.0) | |
n | 182 | 182 | |
Fascia/joints | 1.1 (0.0-3.0) | 1.2 (0.0-3.0) | 1.0 (0.0-3.0) |
n | 346 | 182 | 164 |
TSS | |||
% grade 0 | 68.5 (0.0-100.0) | 68.5 (0.0-100.0) | |
% grade 1 | 10.3 (0.0-90.1) | 10.3 (0.0-90.1) | |
% grade 2 | 5.9 (0.0-63.1) | 5.9 (0.0-63.1) | |
% grade 3 | 10.1 (0.0-65.0) | 10.1 (0.0-65.0) | |
% grade 4 | 5.2 (0.0-54.0) | 5.2 (0.0-54.0) | |
Total | 183 | 183 | |
VSS | 6.3 (0.0-26.5) | 6.3 (0.0-26.5) | |
n | 183 | 183 |
Characteristic . | Total, mean (range) . | Training, mean (range) . | Validation, mean (range) . |
---|---|---|---|
ROM-shoulder | 6.3 (2.0-7.0) | 6.3 (3.0-7.0) | 6.2 (2.0-7.0) |
n | 295 | 138 | 157 |
ROM-elbow | 6.3 (1.0-7.0) | 6.3 (1.0-7.0) | 6.4 (3.0-7.0) |
n | 294 | 139 | 155 |
ROM-wrist | 5.7 (1.0-7.0) | 5.7 (1.0-7.0) | 5.7 (1.0-7.0) |
n | 295 | 138 | 157 |
ROM-ankle | 3.4 (1.0-4.0) | 3.3 (1.0-4.0) | 3.5 (1.0-4.0) |
n | 289 | 137 | 152 |
BSA movable | 11.7 (0.0-77.8) | 11.7 (0.0-77.8) | |
n | 182 | 182 | |
BSA nonmovable | 8.0 (0.0-70.0) | 8.0 (0.0-70.0) | |
n | 182 | 182 | |
Hopkins skin score | 1.8 (0.0-4.0) | 1.8 (0.0-4.0) | |
n | 182 | 182 | |
Fascia/joints | 1.1 (0.0-3.0) | 1.2 (0.0-3.0) | 1.0 (0.0-3.0) |
n | 346 | 182 | 164 |
TSS | |||
% grade 0 | 68.5 (0.0-100.0) | 68.5 (0.0-100.0) | |
% grade 1 | 10.3 (0.0-90.1) | 10.3 (0.0-90.1) | |
% grade 2 | 5.9 (0.0-63.1) | 5.9 (0.0-63.1) | |
% grade 3 | 10.1 (0.0-65.0) | 10.1 (0.0-65.0) | |
% grade 4 | 5.2 (0.0-54.0) | 5.2 (0.0-54.0) | |
Total | 183 | 183 | |
VSS | 6.3 (0.0-26.5) | 6.3 (0.0-26.5) | |
n | 183 | 183 |
ROM, range of motion; TSS, total skin score.
Characteristic . | Total, mean (range) . | Training, mean (range) . | Validation, mean (range) . |
---|---|---|---|
SF36 physical component summary score | 37.5 (11.0-60.7) | 37.5 (13.7-60.7) | 37.4 (11.0-57.9) |
n | 293 | 156 | 137 |
SF36 mental component summary score | 48.5 (7.1-68.4) | 47.2 (17.7-68.4) | 50.0 (7.1-68.2) |
n | 293 | 156 | 137 |
SF36 physical function score | 39.8 (14.9-57.0) | 40.0 (14.9-57.0) | 39.7 (14.9-57.0) |
n | 300 | 162 | 138 |
SF36 role-physical score | 37.1 (17.7-56.9) | 37.4 (17.7-56.9) | 36.6 (17.7-56.9) |
n | 298 | 161 | 137 |
SF36 bodily pain score | 43.0 (19.9-62.1) | 41.5 (19.9-62.1) | 44.7 (19.9-62.1) |
n | 301 | 163 | 138 |
SF36 general health score | 39.3 (16.2-63.9) | 39.1 (16.2-63.9) | 39.6 (16.2-62.5) |
n | 298 | 160 | 138 |
SF36 social functioning score | 41.2 (13.2-56.8) | 40.4 (13.2-56.8) | 42.1 (13.2-56.8) |
n | 301 | 163 | 138 |
SF36 role-emotional score | 44.7 (9.2-55.9) | 43.9 (9.2-55.9) | 45.2 (9.2-55.9) |
n | 298 | 160 | 138 |
SF36 mental health score | 48.9 (10.6-64.1) | 47.5 (16.2-64.1) | 50.5 (10.6-64.1) |
n | 301 | 163 | 138 |
LSS energy score | 41.5 (0.0-96.4) | 43.4 (0.0-96.4) | 39.2 (0.0-92.9) |
n | 304 | 166 | 138 |
LSS lung score | 7.3 (0.0-60.0) | 8.4 (0.0-55.0) | 6.0 (0.0-60.0) |
n | 303 | 165 | 138 |
LSS eye score | 39.8 (0.0-100.0) | 37.2 (0.0-100.0) | 42.8 (0.0-100.0) |
n | 302 | 164 | 138 |
LSS nutrition score | 7.5 (0.0-60.0) | 7.6 (0.0-60.0) | 7.4 (0.0-45.0) |
n | 304 | 166 | 138 |
LSS psychological score | 27.7 (0.0-100.0) | 30.9 (0.0-100.0) | 23.7 (0.0-100.0) |
n | 303 | 166 | 137 |
LSS mouth score | 19.0 (0.0-100.0) | 16.9 (0.0-100.0) | 21.5 (0.0-100.0) |
n | 303 | 165 | 138 |
LSS summary score | 25.3 (4.1-73.3) | 25.9 (4.1-73.3) | 24.5 (4.4-65.9) |
n | 303 | 165 | 138 |
FACT physical well-being | 20.1 (0.0-28.0) | 19.4 (0.0-28.0) | 21.0 (3.0-28.0) |
n | 299 | 161 | 138 |
FACT social/family well-being | 22.0 (2.0-28.0) | 21.7 (2.0-28.0) | 22.3 (7.0-28.0) |
n | 299 | 162 | 137 |
FACT emotional well-being | 18.4 (2.4-24.0) | 17.9 (4.0-24.0) | 18.9 (2.4-24.0) |
n | 299 | 161 | 138 |
FACT functional well-being | 16.3 (1.2-28.0) | 15.7 (1.2-28.0) | 17.0 (2.0-28.0) |
n | 300 | 162 | 138 |
FACT-G | 76.9 (23.0-108.0) | 74.8 (23.0-108.0) | 79.3 (36.0-107.0) |
n | 297 | 160 | 137 |
FACT-BMT subscale | 26.4 (11.0-38.0) | 26.4 (11.0-38.0) | |
n | 162 | 162 | |
FACT trial outcome index | 61.4 (20.0-94.0) | 61.4 (20.0-94.0) | |
n | 161 | 161 | |
FACT-BMT total | 101.3 (36.0-146.0) | 101.3 (36.0-146.0) | |
n | 160 | 160 | |
Joint/muscle aches | 1.9 (0.0-4.0) | 2.0 (0.0-4.0) | 1.7 (0.0-4.0) |
n | 303 | 165 | 138 |
Limited joint movement | 1.8 (0.0-4.0) | 1.8 (0.0-4.0) | 1.7 (0.0-4.0) |
n | 302 | 166 | 136 |
Characteristic . | Total, mean (range) . | Training, mean (range) . | Validation, mean (range) . |
---|---|---|---|
SF36 physical component summary score | 37.5 (11.0-60.7) | 37.5 (13.7-60.7) | 37.4 (11.0-57.9) |
n | 293 | 156 | 137 |
SF36 mental component summary score | 48.5 (7.1-68.4) | 47.2 (17.7-68.4) | 50.0 (7.1-68.2) |
n | 293 | 156 | 137 |
SF36 physical function score | 39.8 (14.9-57.0) | 40.0 (14.9-57.0) | 39.7 (14.9-57.0) |
n | 300 | 162 | 138 |
SF36 role-physical score | 37.1 (17.7-56.9) | 37.4 (17.7-56.9) | 36.6 (17.7-56.9) |
n | 298 | 161 | 137 |
SF36 bodily pain score | 43.0 (19.9-62.1) | 41.5 (19.9-62.1) | 44.7 (19.9-62.1) |
n | 301 | 163 | 138 |
SF36 general health score | 39.3 (16.2-63.9) | 39.1 (16.2-63.9) | 39.6 (16.2-62.5) |
n | 298 | 160 | 138 |
SF36 social functioning score | 41.2 (13.2-56.8) | 40.4 (13.2-56.8) | 42.1 (13.2-56.8) |
n | 301 | 163 | 138 |
SF36 role-emotional score | 44.7 (9.2-55.9) | 43.9 (9.2-55.9) | 45.2 (9.2-55.9) |
n | 298 | 160 | 138 |
SF36 mental health score | 48.9 (10.6-64.1) | 47.5 (16.2-64.1) | 50.5 (10.6-64.1) |
n | 301 | 163 | 138 |
LSS energy score | 41.5 (0.0-96.4) | 43.4 (0.0-96.4) | 39.2 (0.0-92.9) |
n | 304 | 166 | 138 |
LSS lung score | 7.3 (0.0-60.0) | 8.4 (0.0-55.0) | 6.0 (0.0-60.0) |
n | 303 | 165 | 138 |
LSS eye score | 39.8 (0.0-100.0) | 37.2 (0.0-100.0) | 42.8 (0.0-100.0) |
n | 302 | 164 | 138 |
LSS nutrition score | 7.5 (0.0-60.0) | 7.6 (0.0-60.0) | 7.4 (0.0-45.0) |
n | 304 | 166 | 138 |
LSS psychological score | 27.7 (0.0-100.0) | 30.9 (0.0-100.0) | 23.7 (0.0-100.0) |
n | 303 | 166 | 137 |
LSS mouth score | 19.0 (0.0-100.0) | 16.9 (0.0-100.0) | 21.5 (0.0-100.0) |
n | 303 | 165 | 138 |
LSS summary score | 25.3 (4.1-73.3) | 25.9 (4.1-73.3) | 24.5 (4.4-65.9) |
n | 303 | 165 | 138 |
FACT physical well-being | 20.1 (0.0-28.0) | 19.4 (0.0-28.0) | 21.0 (3.0-28.0) |
n | 299 | 161 | 138 |
FACT social/family well-being | 22.0 (2.0-28.0) | 21.7 (2.0-28.0) | 22.3 (7.0-28.0) |
n | 299 | 162 | 137 |
FACT emotional well-being | 18.4 (2.4-24.0) | 17.9 (4.0-24.0) | 18.9 (2.4-24.0) |
n | 299 | 161 | 138 |
FACT functional well-being | 16.3 (1.2-28.0) | 15.7 (1.2-28.0) | 17.0 (2.0-28.0) |
n | 300 | 162 | 138 |
FACT-G | 76.9 (23.0-108.0) | 74.8 (23.0-108.0) | 79.3 (36.0-107.0) |
n | 297 | 160 | 137 |
FACT-BMT subscale | 26.4 (11.0-38.0) | 26.4 (11.0-38.0) | |
n | 162 | 162 | |
FACT trial outcome index | 61.4 (20.0-94.0) | 61.4 (20.0-94.0) | |
n | 161 | 161 | |
FACT-BMT total | 101.3 (36.0-146.0) | 101.3 (36.0-146.0) | |
n | 160 | 160 | |
Joint/muscle aches | 1.9 (0.0-4.0) | 2.0 (0.0-4.0) | 1.7 (0.0-4.0) |
n | 303 | 165 | 138 |
Limited joint movement | 1.8 (0.0-4.0) | 1.8 (0.0-4.0) | 1.7 (0.0-4.0) |
n | 302 | 166 | 136 |
Characteristic . | Total (N = 347) . | Training (n = 183) . | Validation (n = 164) . | P value∗ . |
---|---|---|---|---|
Clinician-reported measures | ||||
PROM at baseline | 21.7 (12.0-25.0) | 21.6 (13.0-25.0) | 21.9 (12.0-25.0) | .44 |
n | 285 | 136 | 149 | |
PROM change | 0.3 (−12.0 to 8.0) | 0.2 (−12.0 to 8.0) | 0.3 (−11.0 to 7.0) | .79 |
n | 265 | 129 | 136 | |
NIH skin score change | −0.2 (−3.0 to 3.0) | −0.3 (−3.0 to 2.0) | −0.2 (−3.0 to 3.0) | .56 |
n | 340 | 180 | 160 | |
Patient-reported measures | ||||
Modified HAP at baseline | 64.5 (9.0-94.0) | 64.9 (9.0-94.0) | 64.0 (24.0-94.0) | .63 |
n | 299 | 161 | 138 | |
SF36 vitality change | 1.1 (−21.9 to 31.2) | 1.5 (−21.9 to 31.2) | 0.7 (−21.9 to 25.0) | .48 |
265 | 140 | 125 | ||
LSS skin score at baseline | 34.1 (0.0-100.0) | 36.7 (0.0-100.0) | 30.9 (0.0-100.0) | .03 |
n | 303 | 165 | 138 | |
LSS skin score change | −9.1 (−75.0 to 60.0) | −9.4 (−75.0 to 35.0) | −8.6 (−70.0 to 60.0) | .76 |
n | 263 | 141 | 122 |
Characteristic . | Total (N = 347) . | Training (n = 183) . | Validation (n = 164) . | P value∗ . |
---|---|---|---|---|
Clinician-reported measures | ||||
PROM at baseline | 21.7 (12.0-25.0) | 21.6 (13.0-25.0) | 21.9 (12.0-25.0) | .44 |
n | 285 | 136 | 149 | |
PROM change | 0.3 (−12.0 to 8.0) | 0.2 (−12.0 to 8.0) | 0.3 (−11.0 to 7.0) | .79 |
n | 265 | 129 | 136 | |
NIH skin score change | −0.2 (−3.0 to 3.0) | −0.3 (−3.0 to 2.0) | −0.2 (−3.0 to 3.0) | .56 |
n | 340 | 180 | 160 | |
Patient-reported measures | ||||
Modified HAP at baseline | 64.5 (9.0-94.0) | 64.9 (9.0-94.0) | 64.0 (24.0-94.0) | .63 |
n | 299 | 161 | 138 | |
SF36 vitality change | 1.1 (−21.9 to 31.2) | 1.5 (−21.9 to 31.2) | 0.7 (−21.9 to 25.0) | .48 |
265 | 140 | 125 | ||
LSS skin score at baseline | 34.1 (0.0-100.0) | 36.7 (0.0-100.0) | 30.9 (0.0-100.0) | .03 |
n | 303 | 165 | 138 | |
LSS skin score change | −9.1 (−75.0 to 60.0) | −9.4 (−75.0 to 35.0) | −8.6 (−70.0 to 60.0) | .76 |
n | 263 | 141 | 122 |
Based on the t test.
By 6 months, the training and validation set patients (presented as percentage of training set patients/percentage of validation set patients for each, respectively) had improvement (52%/53%) per clinician assessment, comprising completely gone (7%/7%), very much better (12%/10%), moderately better (17%/18%), and a little better (17%/18%). The nonimprovement categories included about the same (23%/23%), a little worse (14%/12%), moderately worse (10%/12%), and very much worse (1%/0%). On univariate analysis, many individual measures were associated with clinician-reported treatment response (supplemental Table 4). On final multivariate analysis, baseline PROM summary score, PROM change from baseline to 6 months, and NIH 0 to 3 skin score change from baseline to 6 months were significantly associated with clinician-reported treatment response (Table 4).
Variable . | OR (95% CI) . | P value . |
---|---|---|
Clinician model | ||
PROM at baseline | 1.3 (1.1-1.6) | .002 |
PROM change∗ | 1.5 (1.2- 2.0) | .002 |
NIH skin score change∗ | 0.3 (0.2-0.6) | <.001 |
Patient model | ||
Modified HAP (per 10) | 1.5 (1.1-2.1) | .02 |
SF36 vitality scale change∗ (per 10) | 3.9 (2.0-7.6) | <.001 |
LSS skin scale (per 10) | 0.6 (0.5-0.8) | <.001 |
LSS skin scale change∗ (per 10) | 0.4 (0.2-0.6) | <.001 |
Variable . | OR (95% CI) . | P value . |
---|---|---|
Clinician model | ||
PROM at baseline | 1.3 (1.1-1.6) | .002 |
PROM change∗ | 1.5 (1.2- 2.0) | .002 |
NIH skin score change∗ | 0.3 (0.2-0.6) | <.001 |
Patient model | ||
Modified HAP (per 10) | 1.5 (1.1-2.1) | .02 |
SF36 vitality scale change∗ (per 10) | 3.9 (2.0-7.6) | <.001 |
LSS skin scale (per 10) | 0.6 (0.5-0.8) | <.001 |
LSS skin scale change∗ (per 10) | 0.4 (0.2-0.6) | <.001 |
CI, confidence interval; OR, odds ratio.
Change from baseline to 6 months
By 6 months, the training and validation set patients (presented as percentage of training set patients/percentage of validation set patients for each, respectively) had patient-reported improvement (59%/64%), comprising completely gone (6%/8%), very much better (17%/18%), moderately better (18%/20%), and a little better (18%/18%). For those without improvement, categories were about the same (21%/20%), a little worse (11%/9%), moderately worse (7%/6%), and very much worse (2%/1%). On univariate analysis, multiple individual PRO measures (inclusive of domain and summary scores) were associated with patient-reported response at 6 months (supplemental Table 5). Final multivariate analysis confirmed that baseline HAP adjusted activity score, SF-36 vitality score change from baseline to 6 months, and LSS skin (both baseline and change value from baseline to 6 months) were significantly associated with patient-reported treatment response (Table 4). This model only considered subscales for SF-36, Lee symptom score, and FACT (did not incorporate both domain scores and summary total scores for each PRO measure). Alternative models that incorporated domain and summary scores did not provide more optimal AUC (data not shown) and were not pursued further.
We used ROC plots for the training and validation cohorts to characterize the performance of the final models for the clinician- and patient-reported responses, respectively. The ROC plots for the training and validation cohorts are presented in supplemental Figures 2 and 3, and final model thresholds and performance (sensitivity, specificity, positive predictive values, and negative predictive values) are presented in Table 5.
. | AUC . | Cut-point∗ . | Sensitivity . | Specificity . | PPV . | NPV . |
---|---|---|---|---|---|---|
Clinician model | ||||||
Training | 0.83 | 0.47 | 0.77 | 0.79 | 0.75 | 0.80 |
Validation | 0.75 | 0.47 | 0.60 | 0.79 | 0.77 | 0.63 |
Patient model | ||||||
Cut-point based on Youden J | ||||||
Training | 0.86 | 0.63 | 0.75 | 0.86 | 0.86 | 0.74 |
Validation | 0.75 | 0.63 | 0.52 | 0.84 | 0.85 | 0.50 |
Cut-point based on distance to (1, 1) | ||||||
Training | 0.86 | 0.54 | 0.78 | 0.82 | 0.84 | 0.76 |
Validation | 0.75 | 0.54 | 0.60 | 0.78 | 0.83 | 0.53 |
. | AUC . | Cut-point∗ . | Sensitivity . | Specificity . | PPV . | NPV . |
---|---|---|---|---|---|---|
Clinician model | ||||||
Training | 0.83 | 0.47 | 0.77 | 0.79 | 0.75 | 0.80 |
Validation | 0.75 | 0.47 | 0.60 | 0.79 | 0.77 | 0.63 |
Patient model | ||||||
Cut-point based on Youden J | ||||||
Training | 0.86 | 0.63 | 0.75 | 0.86 | 0.86 | 0.74 |
Validation | 0.75 | 0.63 | 0.52 | 0.84 | 0.85 | 0.50 |
Cut-point based on distance to (1, 1) | ||||||
Training | 0.86 | 0.54 | 0.78 | 0.82 | 0.84 | 0.76 |
Validation | 0.75 | 0.54 | 0.60 | 0.78 | 0.83 | 0.53 |
AUC, area under curve; NPV, negative predictive value; PPV, positive predictive value.
Based on Youden J and distance to (1, 1)
Discussion
Major innovation is needed in the treatment and response assessment of cutaneous sclerosis to advance clinical care, conduct of clinical trials, and ultimately improve patient outcomes. Among existing limitations, NIH response criteria applied to this chronic GVHD subgroup failed to capture certain improvements that are recognized by clinicians and patients. Additionally, numerous clinician- and patient-reported measures of sclerosis burden and associated symptoms and impairments have been routinely captured in prior studies, yet the optimal measure or combination of measures to robustly assess treatment response is not known. To address these gap areas, we leveraged 2 major national Chronic GVHD Consortium observational studies and a prior cutaneous sclerosis–specific national clinical trial to test which sclerosis variables were associated with clinician- and patient-reported response, both of which have been previously demonstrated to have association with long-term treatment success.26,27
In the training set, we found strong association between routinely captured skin and joint/fascia measures and 6-month clinician-assessed treatment response. The PROM baseline score and change value, as well as the change in NIH 0 to 3 skin scores, interestingly were retained in the final model. This speaks to the feasibility of testing this model further in other existing observational data sets or even other recent large clinical trials, given that these measures are routinely captured in baseline and response provider surveys in these settings. In contrast, some measures no longer routinely used (eg, Hopkins scale and Vienna skin scale) were not retained in the final model. The patient-response analysis suggests clarity in which (among many possible) PRO measures have the strongest association with patient-reported response and supports that a combination of the HAP, SF36, and LSS would need to be used for this purpose. These data suggest that both a quality of life (QOL) and symptom-based PRO are needed to adequately capture sclerosis response.
However, although both models had AUC values generally considered to reflect excellent discrimination in the training set, AUC values in the validation set could only be considered acceptable at best, with significant risks for response misclassification. Accordingly, the model in current state requires further refinement and validation and could not be applied in current state to clinical trials or routine practice. The inferior results in the validation set may be due to several known or unknown factors. Likely one of the largest issues is the inherent diversity within and between the populations we examined in this analysis. The included patients uniformly had cutaneous sclerosis, yet had diversity in type, anatomical site, and severity of sclerotic features, varied functional impairments, and varied duration of prior sclerosis before enrollment in the parent studies included here. Another major potential contributor is interobserver variability (both in terms of clinicians and patients) in rating treatment response, as well as differential weighting of improvements in reporting overall response. Additionally, there was marked heterogeneity in the therapeutic agents used across the 3 included studies (with potential variation in treatment efficacy), an inherent challenge given the diversity in patients enrolled in these studies as well as the range of available therapeutic agents.
In total, our results demonstrate that further research is needed. One avenue for additional progress would be a similar exercise in training/validation of a response model using larger and potentially more uniform patient populations; however, it is not possible in the near term based on limited availability of such resources. We also note, for example, the completion of several large chronic GVHD trials (eg, those testing ruxolitinib, belumosudil, or axatilimab)10,11,28 in the recent past, and these study populations could in future work be examined using the methods we have used here in our study population. Separately, novel measures (eg, skin thickness measures,29,30 novel imaging modalities,31-33 tissue and/or blood biomarkers) may ultimately provide new insight and a path forward to optimal response assessment in cutaneous sclerosis, potentially including a composite model incorporating both clinical measures and novel tools. As well, other future directions include development of sclerosis-specific tools, including a sclerosis–specific PRO measure.
Acknowledgments
The authors acknowledge grant funding support CA163438 and CA118953.
Authorship
Contribution: J.A.P, L.O., and S.J.L. designed the study, conducted the analysis, and wrote the manuscript; and E.B., P.A.C., C.C., S.A., C.L.K., and G.L.C. provided significant input on the study analysis and writing of the manuscript.
Conflict-of-interest disclosure: The authors declare no competing financial interests.
Correspondence: Joseph A. Pidala, Blood and Marrow Transplantation and Cellular Immunotherapy, H. Lee Moffitt Cancer Center and Research Institute, 12902 Magnolia Drive, Tampa, FL 33612; email: joseph.pidala@moffitt.org.
References
Author notes
For potential data sharing, please inquire to the Chronic Graft-versus-Host Disease Consortium, Stephanie J Lee (sjlee@fredhutch.org).
The full-text version of this article contains a data supplement.