Key Points
Using genome-wide association study, we found the first replicated genetic association with acute chest syndrome in sickle cell disease patients.
The locus identified includes COMMD7, a gene highly expressed in the lung that interacts with NFκB to control inflammatory responses.
Abstract
Patients with sickle cell disease (SCD) present with a wide range of clinical complications. Understanding this clinical heterogeneity offers the prospects to tailor the right treatments to the right patients and also guide the development of novel therapies. Several environmental (eg, nutrition) and nonenvironmental (eg, fetal hemoglobin levels, α-thalassemia status) factors are known to modify SCD severity. To find new genetic modifiers of SCD severity, we performed a gene-centric association study in 1514 African American participants from the Cooperative Study of Sickle Cell Disease (CSSCD) for acute chest syndrome (ACS) and painful crisis. From the initial results, we selected 36 single nucleotide polymorphism (SNPs) and genotyped them for replication in 387 independent patients from the CSSCD, 318 SCD patients recruited at Georgia Health Sciences University, and 449 patients from the Duke SCD cohort. In the combined analysis, an association between ACS and rs6141803 reached array-wide significance (P = 4.1 × 10−7). This SNP is located 8.2 kilobases upstream of COMMD7, a gene highly expressed in the lung that interacts with nuclear factor-κB signaling. Our results provide new leads to gaining a better understanding of clinical variability in SCD, a “simple” monogenic disease.
Introduction
Sickle cell disease (SCD) is among the most common Mendelian diseases worldwide and is particularly prevalent in regions where malaria is endemic.1,2 SCD is caused by mutations in the β-globin gene that encodes 1 of the subunits of the oxygen carrier hemoglobin and is characterized by a wide spectrum of disease-specific complications. In the deoxygenated state, sickle hemoglobin forms long polymers that alter erythrocyte shape and flexibility, thus increasing hemolysis and the adherence between sickled red blood cells and the endothelium.3,4 Although hemolysis and cell adherence are the main causes of complications in SCD, little is known about the additional environmental and nonenvironmental factors that may modify disease severity and therefore explain the remarkable clinical heterogeneity observed in this otherwise simple monogenic disease.
Environmental variables such as nutrition and sufficient hydration are linked to clinical heterogeneity in SCD.4,5 Two nonenvironmental factors, high fetal hemoglobin (HbF) levels and concomitant α-thalassemia, also correlate with reduced morbidity and mortality in SCD.6 The identification of additional disease-severity modifiers may yield novel insights into SCD pathophysiology. A genetic association study may improve understanding of SCD clinical heterogeneity because it attempts to correlate DNA sequence variants with SCD-specific complications or relevant clinical variables (eg, HbF). Over the last 2 decades, several genetic associations in SCD have been published, but the results are questionable because of small sample size and lack of replication (reviewed in ref. 7). There are, however, 2 notable exceptions: robust associations between (1) 3 loci (BCL11A, HBS1L-MYB, and β-globin) and HbF levels8,9 and (2) the bilirubin levels–associated UGT1A1 locus and gallstones.10,11
To find novel genetic modifiers of SCD, we performed a gene-centric association study in 1514 participants from the Cooperative Study of Sickle Cell Disease (CSSCD) for acute chest syndrome (ACS) and painful crisis. For genotyping, we used the ITMAT-Broad-CARe (IBC) array, which covers genetic variation at ∼2100 genes important for heart, lung, and blood diseases.12 Although the CSSCD is one of the largest existing SCD cohorts, our discovery power is modest to detect variants of small effect on phenotypic variation. For this reason, we genotyped for replication 36 variants that reached P < 1.0 × 10−4 in the CSSCD discovery sample in the DNA of 387 independent SCD patients from the CSSCD. We also genotyped markers in 318 SCD patients recruited at Georgia Health Sciences University and 449 patients from Duke University. Our analysis identified 1 single nucleotide polymorphism (SNP; rs6141803) near COMMD7 that reached array-wide significance, defined as P < 2.0 × 10−6 after accounting for the number of independent SNPs present on the IBC array.13 Overall, our findings prioritize DNA sequence variants and genes for future genetic and functional follow-up experiments in order to better grasp patient-to-patient clinical variability in SCD.
Methods
Ethics statement
Informed consent was obtained for all participants in accordance with the Declaration of Helsinki. The Candidate-gene Association Resource (CARe) Study is approved by the ethics committees of the participating studies and of the Massachusetts Institute of Technology. This project was also reviewed and approved by the Montreal Heart Institute Ethics Committee, the Duke Institutional Review Board, and the different recruiting centers.
Samples and genotyping
The CSSCD is described in detail elsewhere.14 Briefly, the CSSCD was a multicenter prospective study of the natural history of SCD; participant enrollment into phase 1 of the CSSCD began in 1978. Participant entry ended in 1981 for all patients older than age 6 months; however, infants continued to be enrolled until 1988. Both mild and hospital-based SCD patients were recruited. A total of 4085 participants, mostly African Americans and ranging in age from newborns to adults, were enrolled in phase 1 from 23 centers across the United States. Data collection for phase 1 of the CSSCD ended in 1988 (see https://biolincc.nhlbi.nih.gov/studies/csscd/ for more information on the study design).
In the CSSCD, painful crisis and ACS events were defined as previously described by the CSSCD investigators and as reported in the CSSCD phase 1 clinical database.15-17 Briefly, a painful crisis episode was defined as an occurrence of pain lasting ≥2 hours in the extremities, back, abdomen, chest, or head that could not be explained by a mechanism other than SCD. Pain episodes within 14 days were treated as a single episode. An episode of ACS occurred when a participant developed a new infiltrate on chest radiograph and/or had a perfusion defect detected on a lung radioisotope scan. Painful crisis and ACS were analyzed as rates by dividing the number of events by the number of patient-years.
A total of 318 patients from the Adult Sickle Cell Clinic of Georgia Health Sciences University (GHSU) Sickle Cell Center were included in this study as a validation cohort; patient age ranged from 20 to 74 years and 169 women and 149 men were included in the cohort. In the GHSU cohort, ACS was defined as a new pulmonary infiltrate involving more than 1 lung segment with fever, chest pain, and hypoxia. This excludes patients who present with classic lobar pneumonia, although the distinction may not be 100%. The Duke SCD cohort included 449 adult patients (199 men and 250 women) and used the following questions to define ACS: Have you ever experienced acute chest syndrome or pneumonia requiring hospitalization? and to define painful crisis: In the past 12 months, have you had painful episodes requiring hospitalization?. Demographics for the 3 SCD cohorts used in this study are summarized in Table 1.
DNA genotyping on the Illumina IBC array was carried out at the Broad Institute as part of the National Heart, Lung and Blood Institute (NHLBI) CARe Project. The IBC array interrogates genotypes at ∼50 000 SNPs and captures genetic variation at ∼2100 genes relevant for heart, lung, and blood diseases.12 Data quality control and genotype imputation were performed as previously described.13 Imputation was performed using MACH 1.0.16.18 MACH requires phased reference haplotypes to perform imputation. For the African American CSSCD participants, a combined Northern European (CEU) + Western Africans (YRI) reference panel was created using HapMap phase 2 data.19 This panel includes SNPs segregating in both CEU and YRI, as well as SNPs segregating in 1 panel and monomorphic and nonmissing in the other. Imputation was performed in 2 steps: For the first step, 300 individuals were randomly extracted to generate recombination and error rate estimates. In the second step, these rates were used to impute all individuals across the entire reference panel. Imputation results were filtered at an rsq_hat threshold ≥0.6 and a minor allele frequency (MAF) threshold ≥1%. For imputed markers with strong statistical association, we directly genotyped an overlapping set of CSSCD DNA samples (N = 777) and found high concordance with the imputation results (mean Pearson correlation coefficient = 0.87; range, 0.65-1.0). The final CSSCD discovery dataset included 1514 DNA samples with a genotyping success rate >99.8% (47 092 genotyped SNPs and 190 551 imputed SNPs). Genotyping in the CSSCD and GHSU replication cohorts was performed using the mass spectrometry–based MassArray iPLEX platform from Sequenom, removing SNPs and DNA samples with a genotyping success rate <95% and <90%, respectively. The concordance rate, which was estimated from replicates, was >99.7%. Genotyping and quality-control filters for the Duke SCD cohort were described elsewhere.20 We used identity-by-descent methods to identify patients who overlap in the CSSCD and Duke cohorts; these patients (N = 23) were excluded from the analysis of the Duke cohort.
Statistical analysis
We performed our discovery experiment in the CSSCD,14,21 a large longitudinal study with hundreds of clinical variables available. Despite this wealth of phenotypic data, we focus initially on only 2 quantitative measures, ACS and painful crisis rates, for two reasons. First, statistical power to find genetic associations with quantitative phenotypes was higher than for dichotomous traits (despite remaining relatively modest), even in the large CSSCD. For instance, we estimated that our study design had 50% power to find an association between a quantitative trait and a SNP under the following assumptions: minor allele frequency = 25%, variance explained = 1%, and α = 1 × 10−4 (Table 1). In comparison, we only had 8% power to find an association with stroke (prevalence = 7%) under the same assumptions for a variant with an odds ratio (OR) = 1.5. Second, because the size of our replication cohorts is small, for low prevalence complications, it is likely that there would be too few affected SCD patients to robustly validate genetic associations observed in the CSSCD discovery cohort.
In the CSSCD discovery cohort, we tested associations between 237 643 genotyped (0, 1, 2) or imputed (0.0-2.0) common SNPs (MAF ≥1%) and phenotypes using Poisson regression (correction for overdispersion) for painful crisis and ACS rates.17 We implemented the analysis using custom scripts in the R 2.10.0 statistical package (www.r-project.org/). We used sex, age at baseline, and the first 10 principal components as covariates. Analyses were stratified based on α-thalassemia status, and association results were combined by inverse variance meta-analyses.22 Analyses of global and local ancestry were performed using, respectively, the EIGENSOFT and HAPMIX software with their default parameters.23-25
Analyses in the CSSCD replication cohort were performed as for the CSSCD discovery cohort, except that we used α-thalassemia status as a covariate because of the small sample size of the cohort and we did not have access to principal components. In the GHSU SCD cohort, age at baseline and painful crisis information were not available. We analyzed the association between genotypes and ACS (dichotomous) using logistic regression in PLINK26 and sex and year of birth as covariates. For the Duke SCD cohort, logistic regression was used to determine the effect of genotype on a binary definition of ACS using PLINK.26 To examine the impact of number of hospitalizations for painful crisis episodes, logistic regression was employed using SAS version 9.2 (SAS Systems, Cary, NC). In an attempt to reduce any population substructure that may exist, principal component analysis was performed using EIGENSOFT.23 All models were adjusted for sex, age, and the first 2 principal components.
To analyze the effect of heme on the expression of COMMD7 in pulmonary endothelial cells, we accessed the relevant gene expression dataset on the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) website (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi accession number = GSE25014).27 We performed the analyses separately for the pulmonary microvascular endothelial cells and the pulmonary artery endothelial cells using the GEO2R analytical module and default parameters (http://www.ncbi.nlm.nih.gov/geo/info/geo2r.html). We corrected P values using the Benjamini–Hochberg false-discovery rate method.
Results
Several complications are observed in SCD and are linked to the quality of life and life expectancy of patients with this hemoglobinopathy.15,16,28-31 Our goal in this study was to identify genetic variants associated with 2 of these complications, ACS and painful crisis, in order to better understand clinical heterogeneity in SCD. We modeled and analyzed ACS and painful crisis rates as previously described using Poisson regression.15-17 Although for ACS the distribution of the observed test statistics does not show major departure from the expected null distribution (λGC = 1.032), there was a slight inflation for the painful crisis association results (λGC = 1.069; Figure 1).32 For this reason, we corrected both ACS and painful crisis association results using the genomic control (GC) approach. However, overall the limited inflation that was observed indicates that our analysis was appropriate and accounted for the main possible confounders. To declare statistical significance on the IBC array, we selected a threshold of α = 2 × 10−6 that is sufficient to account for the number of independent tests performed.13 Using this criterion, we identified a single locus that met array-wide significance: an association between rs11817401 in the SORCS1 gene and painful crisis rate (P = 1.2 × 10−7; Figure 1 and Table 2).
To confirm this association and also identify additional loci that appeared promising but did not reach statistical significance in the CSSCD discovery cohort, we selected all SNPs with discovery P < 1 × 10−4 (before GC correction; 19 SNPs for painful crisis and 17 SNPs for ACS) and genotyped them in 387 independent CSSCD participants. Replication and combined association results are presented in Table 2 and Table 3 for painful crisis and ACS, respectively. The association between SORCS1-rs11817401 and painful crisis did not replicate (replication P = .52, opposite direction of effect). Overall, we replicated in this small CSSCD cohort 2 associations at nominal level (P < .05): an association between FAM193A-rs11732673 and painful crisis (replication P = .02, combined P = 9.9 × 10−6) and an association between rs6141803 and ACS (replication P = .003, combined P = 5.2 × 10−7), the latter reaching array-wide significance when combining the CSSCD discovery and replication results (Table 3). FAM193A encodes a protein with no clear biological functions. The rs6141803 SNP is located between the COMMD7 and DNMT3B genes on chromosome 20. DNMT3B encodes a DNA methyltransferase, which is important for development, and COMMD7, a gene highly expressed in the lung, codes for an adaptor protein that interacts with subunits of the nuclear factor (NF)-κB complex.33 Importantly, treating human pulmonary endothelial cells with free heme, a model that recapitulates some of the cellular responses observed when ACS is induced in a SCD mouse model,34 significantly modulates the expression of COMMD7 (differential expression in pulmonary microvascular endothelial cells [P = 5 × 10−4] and in pulmonary artery endothelial cells [P = 3 × 10−5]).27 This result, together with a role in NF-κB signaling and inflammation, adds additional evidence supporting a role for COMMD7 in ACS.
Our group had access to 2 additional SCD replication cohorts: 318 patients recruited at GHSU and 449 SCD patients from Duke University (Table 1). Painful crisis information was not available for the GHSU cohort and only available as categories for the Duke cohort. ACS information was available for both cohorts but only in the form of a binary presence/absence phenotype. In many situations, dichotomizing a quantitative trait can lead to substantial loss in statistical power as individuals with 1 or several ACS events are all labeled as affected.35 For replication in the GSHU SCD cohort, we genotyped the top 17 SNPs associated with ACS (Table 3). A single variant, rs17728960 in the NFATC2 gene, was nominally significant (P = .05), but the combined association result was not significant. The association between ACS and COMMD7-rs6141803 was not significant (P = .32) but trended in the right direction (OR = 0.41; Table 3). Genome-wide genotype data were available for the Duke SCD cohort. After quality-control steps and genotype imputation (see Materials and Methods section), 6 painful crisis and 14 ACS SNPs were available for association testing. The association between ACS and rs6141803 near COMMD7 in the Duke cohort showed a consistent direction of effect (OR = 0.16, P = .08; Table 3). When we combine at the P value level results from the CSSCD discovery, CSSCD replication, and GHSU and Duke cohorts using a Z-score method weighted based on sample size, the association between ACS and rs6141803 is array-wide significant (weighted P = 4.1 × 10−7).
We noted that COMMD7-rs6141803 is an ancestry informative marker: the C allele has a frequency of 17% and 0% in the HapMap individuals of Northern European (CEU) and African (YRI) ancestry, respectively. This observation raises the possibility that the association between ACS and COMMD7-rs6141803 is a false-positive result owing to admixture. However, this is unlikely because the ACS rate is not correlated with the first principal component, which captures European vs African admixture (Spearman ρ = −0.0188, P = .47), and we used the first 10 principal components in our analysis to account for global admixture. Although the association between ACS rate and genotypes at rs6141803 is not spurious because of admixture, we tried to use local ancestry to fine-map the causal variant. We inferred local European vs African ancestry at the locus and used this estimate as a covariate in our regression model.24 The strength of the genetic association between ACS and rs6141803 was reduced when controlling for local ancestry but remained significant (P = 5.4 × 10−5 and P = .001 without and with local ancestry as covariate), suggesting that rs6141803 is unlikely to be the causal variant at the locus. rs6141803 is intergenic and not in linkage disequilibrium (LD) with any nonsynonymous DNA sequence variants identified by the 1000 Genomes Project.36 Interestingly, however, it is in weak LD (r2∼0.2–0.3) with cis-eQTL SNPs associated with COMMD7 transcript levels in human liver37 and monocytes.38
Discussion
Painful crisis and ACS are, respectively, the first and second most frequent causes of hospital admissions in patients with SCD.39 Although they share common causes (eg, vaso-occlusion), it is also clear that some of the triggering factors are different (eg, the role of infections in ACS). We performed one of the largest genetic association experiments to date in order to identify DNA sequence variants that modify SCD clinical severity through these 2 measures of morbidity.
Our experimental design identified a single SNP, rs6141803, that reached array-wide significance (P = 5.2 × 10−7 in the CSSCD discovery + replication, P = 4.1 × 10−7 if we add the GHSU and Duke SCD samples). Analyzing ACS as a dichotomous phenotype in the GHSU and Duke cohorts, as we did in this study, could account for a loss of statistical power. One additional difference between the CSSCD and the other 2 cohorts is the age of the enrolled patients. The GHSU and Duke cohorts are essentially adult cohorts, whereas the CSSCD includes a large number of children (Table 1). Age is a known predictor of ACS events16 and, for this reason, was included as a covariate in our statistical model. We performed a sensitivity analysis in the CSSCD discovery cohort that clearly shows that the association signal between ACS and rs6141803 is observed most often in children: the association was strong in patients recruited before they were aged 5 years (N = 335, P = 4 × 10−6) but not significant in older patients (N = 978, P = .80). Thus, this age effect might explain why we could replicate this association in the CSSCD replication cohort (which also includes children) but not in the adult GHSU and Duke cohorts. It is also possible that this association is a false-positive report. Despite allele frequency differences between the ancestral populations at rs6141803, we performed analyses that suggest that admixture is unlikely to confound this result. Additional replication attempts in large SCD cohorts with quantitative measures of ACS are needed before drawing a final conclusion on the ACS-rs6141803 genetic association.
rs6141803 is an intergenic SNP located between DNMT3B and COMMD7. DNMT3B encodes for a DNA methyltransferase that is involved in maintenance DNA methylation. Mutations in DNMT3B cause immunodeficiency-centromeric instability-facial anomalies syndrome-1 (ICF-1; MIM #242860), a very rare syndrome that has not been linked to lung-related complications. COMMD7 encodes a poorly known protein that contains a copper metabolism gene MURR1 (COMM) domain. The gene is abundantly expressed in the lung33 and is overexpressed in hepatocellular carcinoma.40 The knockdown of COMMD7 using short-hairpin RNA increases apoptosis and cell cycle arrest, in part, by interfering with NF-κB signaling.33,40 NF-κB is a master regulator of acute inflammation; upon stimulation, it transcriptionally activates interleukins, interferon, tumor necrosis factor-α, and adhesion molecules. The expression of COMMD7 in pulmonary endothelial cells is also affected upon heme treatment27 ; this is promising as free heme can induce ACS in a SCD mouse model.34 Although more work is needed to clarify the possible role of COMMD7 in ACS, these are potentially interesting observations given the importance of inflammation and free radical production in aggravating ACS episodes.41
A recent large candidate gene study in 942 SCD children identified an association between a microsatellite in the heme oxygenase-1 (HMOX1) promoter and ACS: longer alleles were associated with increased rate of hospitalization for ACS.42 Results from this initial study were not replicated in an independent cohort. Therefore, we queried our own results to test if HMOX1 SNPs were associated with ACS rate in the CSSCD. Although the HMOX1 gene was targeted for genotyping on the IBC array (28 genotyped or imputed nearby SNPs), the microsatellite was not directly tested. Of the HMOX1 SNPs that are accessible on the IBC array, 1 SNP in the 3′ untranslated region of HMOX1 is associated with ACS at nominal significance (rs12160039; P = .02). We would need to directly genotype the promoter microsatellite to determine if this SNP captures the association signal with ACS through LD. Similarly, although an intronic sequence repeat polymorphism in the NOS1 gene has been proposed to influence ACS risk,43,44 none of the NOS1 SNPs tested in our study showed significant association results with ACS.
The second most interesting association identified in our experiment is between painful crisis rate and rs12720497, an intronic SNP in the PLA2G4A gene. The association is not array-wide significant, but the directions of effect are consistent between the discovery and replication CSSCD panels (replication P = .08, combined P = 1.2 × 10−5; Table 2). PLA2G4A encodes a cytosolic phospholipase A2, an enzyme implicated in the production of proinflammatory molecules (prostaglandins and leukotrienes) that has previously been implicated in increased sensitivity to pain (hyperalgesia) in humans.45,46 Enzymes in the phospholipase A2 family can be divided into 4 groups (cytosolic, secreted, calcium-independent, and lipoprotein-associated), and high levels of secreted phospholipase A2 have been suggested to be predictive of future ACS events.47-49 The link between cytosolic and secreted phospholipase A2 and their role in SCD complications is intriguing, especially because severe ACS often occurs in the course of vaso-occlusive painful crisis. In our data, however, rs12720497 in PLA2G4A (coding for cytosolic phospholipase A2) is not associated with ACS rate (P = .67).
We performed our discovery search in 1514 participants from the CSSCD whose DNA was genotyped on gene-centric genotyping arrays.12 With the caveat that this genotyping platform only captures genetic variation at a subset (∼10%) of the predicted human genes, we did not identify loci with moderate to strong effect on phenotype, which is consistent with most reported genome-wide association study results.50 Our results highlight promising variants for further replication in independent SCD cohorts and biologically plausible candidate genes (eg, COMMD7, PLA2G4A) to test functionally, for instance, in SCD mouse models. They are also indicative of the importance of combining genome-wide association study results through meta-analyses between SCD cohorts to gain sufficient statistical power to identify genetic associations of weak phenotypic effect.
The publication costs of this article were defrayed in part by page charge payment. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Acknowledgments
The authors acknowledge the contribution of Mélissa Beaudoin for DNA genotyping and Cameron D. Palmer for genotype imputation and thank all the patients who contributed to this study.
CSSCD is supported in part by the National Institutes of Health, National Heart, Lung, and Blood Institute (N01-HB-47110). CARe is supported by the National Heart, Lung, and Blood Institute (HHSN268200625226C). A full listing of the grants and contracts that have supported CARe is provided at http://www.nhlbi.nih.gov/resources/geneticsgenomics/programs/care.htm. The work in the Lettre Laboratory is supported by a Innovation in Clinical Research Award grant from the Doris Duke Charitable Foundation (2009089), the Canada Research Chair Program, the Canadian Institute of Health Research (123382), and the Fonds de Recherche Santé Québec. The work at Duke University was funded in part by the National Heart, Lung, and Blood Institute (RO1 HL079915 and RC2-HL101212).
Authorship
Contribution: G.G., S.C., G.L., and G.J.P. conceived and designed the experiment; G.G., S.C., M.E.G., N.J., K.S., and G.L. performed experiments; G.G., S.C., M.E.G., N.J., M.P., D.P., K.S., A.G., A.E.A.-K., M.J.T., A.K., G.L., and G.J.P. analyzed the results; and G.G. and G.L. wrote the manuscript with contributions from all authors.
Conflict-of-interest disclosure: The authors declare no competing financial interests.
Correspondence: Guillaume Lettre, Montreal Heart Institute, 5000 Belanger St, Montreal, Quebec, Canada, H1T 1C8; e-mail: guillaume.lettre@umontreal.ca; and George J. Papanicolaou, National Heart, Lung, and Blood Institute, 6701 Rockledge Dr, Bethesda, MD 20892; e-mail: gjp@mail.nih.gov.
References
Author notes
G.L. and G.J.P codirected the study.