We conducted a large-scale association study to identify low-penetrance susceptibility alleles for chronic lymphocytic leukemia (CLL), analyzing 992 patients and 2707 healthy controls. To increase the likelihood of identifying disease-causing alleles we genotyped 1467 coding nonsynonymous single nucleotide polymorphisms (nsSNPs) in 865 candidate cancer genes, biasing nsSNP selection toward those predicted to be deleterious. Preeminent associations were identified in SNPs mapping to genes pivotal in the DNA damage-response and cell-cycle pathways, including ATM F858L (odds ratio [OR] = 2.28, P < .0001) and P1054R (OR = 1.68, P = .0006), CHEK2 I157T (OR = 14.83, P = .0008), BRCA2 N372H (OR = 1.45, P = .0032), and BUB1B Q349R (OR = 1.42, P = .0038). Our findings implicate variants in the ATM-BRCA2-CHEK2 DNA damage-response axis with risk of CLL.
Introduction
Chronic lymphocytic leukemia (CLL) is the most common form of leukemia and is 1 of a number of B-cell lymphoproliferative disorders (B-cell LPDs) that include Hodgkin lymphoma (HL) and non-Hodgkin lymphoma (NHL). Inherited predisposition to CLL and other B-cell LPDs is well documented, with epidemiologic studies showing that the risk of CLL in first-degree relatives of patients with CLL is elevated 7-fold.1,2 Furthermore, studies have demonstrated that familial associations exist between different types of B-cell LPDs, with risks of HL and NHL increased 2-fold in relatives of patients with CLL.2 Whereas part of the familial risk could be due to high-penetrance mutations in as-yet-unidentified genes, a polygenic model based on low-penetrance alleles provides an alternative explanation. Such a hypothesis is supported by the recent observation that monoclonal B-cell lymphocytosis with an identical phenotype to indolent CLL can be detected in a high proportion of healthy members of CLL families.3,4
Alleles conferring small relative risks are difficult, if not impossible, to identify through classic genome-wide linkage scans.5 The search for low-penetrance disease alleles has therefore centered on association studies based on comparing the frequency of polymorphic genotypes in patients and control subjects. The spectrum of mutations in Mendelian disease genes, coupled with issues of statistical power, provides a compelling rationale for the application of a sequence-based approach targeting nonsynonymous single nucleotide polymorphisms (nsSNPs) rather than reliance on a map of anonymous haplotypes.6 We sought to identify novel low-penetrance susceptibility alleles for CLL by genotyping nsSNPs across 865 genes with relevance to cancer biology, biasing selection of nsSNPs toward those likely to have deleterious consequences. Genotyping 992 patients with CLL and 2707 healthy controls from the United Kingdom population across 1467 nsSNPs provided strong evidence for association between genes in the DNA damage-response and cell-cycle pathways and risk of CLL.
Patients, materials, and methods
Patients and control subjects
Patients with adult CLL (992 total; 688 men, 304 women; mean age at diagnosis, 61 years; SD ± 11.4) ascertained through the Royal Marsden Hospital National Health Service Trust (RMHNHST) Haemato-Oncology Unit were included in the study. The RMHNHST serves as a tertiary referral center and patient selection was not significantly biased to any specific geographic region within the United Kingdom. The diagnosis of B-cell CLL in patients was established using standard clinico-pathologic and immunologic criteria in accordance with current World Health Organization classification guidelines.7 A total of 2707 healthy individuals were recruited as part of the National Cancer Research Network Trial (1999-2002), the Royal Marsden Hospital Trust/Institute of Cancer Research Family History and DNA Registry (1999-2004), or the National Study of Colorectal Cancer Genetics Trial (2004), all established within the United Kingdom. Control subjects (836 men, 1871 women; mean age, 59 years; SD ± 10.9) were the spouses of patients with nonhematologic malignancies. None of the controls had a personal history of malignancy. All patients and control subjects were white and British, and there were no obvious differences in the demography of patients and control subjects in terms of place of residence within the United Kingdom. Blood samples were obtained with informed consent and ethics review board approval in accordance with the tenets of the Declaration of Helsinki. DNA was extracted from samples using conventional methodologies and quantified using PicoGreen (Invitrogen, Carlsbad, CA).
Selection of candidate genes and SNPs
We have previously established a publicly accessible PICS (Predicted Impact of Coding SNPs) database of potentially functional nsSNPs in genes with relevance to cancer biology.8 Briefly, candidate genes were identified by interrogating the Gene Ontology Consortium database,9 Kyoto Encyclopedia of Genes and Genomes database,10 Iobion's Interaction Explorer PathwayAssist Program, National Center for Biotechnology Information (NCBI) Entrez Gene database,11 and the CancerGene database. Both keyword and gene-pathway specific queries were performed using the following categories: catalytic activity; cellular processes, growth, and death; development; enzyme regulator activity; folding, sorting, and degradation; ligand-receptor interaction; nucleotide metabolism; physiologic processes; regulation of biologic processes; replication and repair; signal transduction and signal transducer activity; transcription and transcription regulator activity; translation and translation regulator activity; and transporter activity. A total of 9537 validated nsSNPs with minor allele frequency (MAF) data were identified within 21 506 LocusLink annotated genes in NCBI dbSNP Build 123. Filtering this list and linking it to 7080 candidate cancer genes yielded 3666 validated nsSNPs with MAF of 0.01 or more in white populations. The functional impact of each nsSNP was predicted using the in silico computational tools PolyPhen12 and SIFT (version 2.1).13 Using the PICS database and published work on resequencing of DNA repair genes,14-18 we prioritized a set of 1467 nsSNPs for the current study (Figure S1, available on the Blood website; see the Supplemental Figures link at the top of the online article). Annotated flanking sequence information for each SNP was derived from the University of California Santa Cruz (UCSC) Human Genome Browser (Assembly hg17).
SNP genotyping and data manipulation
Genotyping of samples was performed using customized Illumina Sentrix Bead Arrays (Illumina, San Diego, CA) according to the manufacturer's protocols. DNA samples with GenCall (Illumina) scores lower than 0.25 at any locus were considered “no calls.” A DNA sample was deemed to have failed if it generated genotypes at fewer than 95% of loci. A SNP was deemed to have failed if fewer than 95% of DNA samples generated a genotype at the locus. Conversion of genotype data into formats suitable for processing was performed using in-house Perl scripts (available upon request). Conventional statistical manipulations were undertaken in STATA (version 8; http://www.stata.com), S-Plus (version 7; http://www.insightful.com) or R (version 2.0.0; http://www.r-project.org).
Population stratification
Genotypic frequencies in control subjects for each SNP were tested for departure from Hardy-Weinberg equilibrium (HWE) using a χ2 test or Fisher exact test where an expected cell count was less than 5. SNPs that violate the HWE in the control population can indicate selection bias or genotyping errors, and were thus removed from further analyses. To detect and control for possible population stratification, we employed the genomic control approach19 using all SNPs to estimate the stratification parameter
Risk of CLL associated with nsSNPs
The most efficient test of association depends on the true mode of inheritance of alleles. Since this is not known, we based our analyses on the difference between allelic frequencies in patients and control subjects using the χ2 test with 1 degree of freedom or Fisher exact test if an expected cell count was less than 5. We denoted this test statistic TA with the corresponding P value PA. We also investigated 2 further tests based on 2 × 2 tables combining the heterozygotes with either the common or rare homozygotes to derive the statistics TR and TD with corresponding P values PR and PD, which are most powerful under recessive or dominant models, respectively. The risks associated with each SNP were estimated by allelic, dominant, and recessive odds ratios (ORs) using unconditional logistic regression, and associated 95% CIs were calculated in each patient. Where it was not possible to calculate ORs and their CIs by asymptotic methods, an exact approach was implemented using LogXact software (Cytel, Cambridge, MA).
Multiple testing
Correction for multiple testing in association studies using simple adjustment approaches such as the Bonferroni correction are known to be conservative due to the assumption of independence between tests, which can lead to type II errors. To control the type II error rate, we adopted an empirical Monte Carlo simulation approach20 based on 10 000 permutations, which takes into account the fact that tests may be correlated due to the presence of linkage disequilibrium (LD) throughout the genome. At each iteration patient and control subject labels are permuted at random and maximum test statistics TAmax, TDmax, and TRmax are determined. For each of these statistics (allelic, dominant, or recessive models), significance levels of the observed statistics from the original data are then estimated by the proportion of permutation samples with TAmax, TDmax, and T maxR larger than that in the observed data. Although this approach adjusts for multiple testing for each of the 3 statistics separately, the consequent increase in false-positive rate is expected to be small due to the strong dependence between tests.
Assessment of linkage disequilibrium between SNPs
To assess the level of LD between SNPs, we calculated the pairwise LD measure D′ between consecutive pairs of markers throughout the genome using the expectation-maximization algorithm to estimate 2-locus haplotype frequencies. We chose to use the measure D′ as it is less sensitive to small minor allele frequencies than other measures such as r2. This information was used to investigate the relationship between haplotypes and disease status. Specifically, haplotypes were reconstructed using a Markov chain Monte Carlo method, and their frequencies in patient and control samples compared by permutation testing, using the PHASE program (http://www.stat.washington.edu/stephens/software.html).21,22
Covariates and interactions
Information on a number of covariates was available for the patients, including sex, family history of CLL, and age at diagnosis. The test statistic TA was computed for all subgroups, together with ORs and their associated 95% CIs. Under certain conditions, a 2-stage process incorporating estimates of pairwise interactions between significant SNPs can yield greater power to detect association.23 To investigate epistatic interactions, each pair of SNPs that displayed a significant allelic association at the 5% level was evaluated by fitting a saturated logistic regression model and the log likelihood ratio statistic for comparison with the main effects model computed. This was compared against a χ2 distribution with 1 degree of freedom (d.f.). Statistics were then adjusted for multiple testing using a Bonferroni correction.
Results
Data quality and genotyping success
Of the 3699 DNA samples submitted for genotyping, a total of 3657 samples were successfully processed, generating in excess of 4 million genotypes. Genotypes were obtained for 962 (97.0%) of 992 patients and 2695 (99.6%) of 2707 control subjects. The likelihood of a DNA sample failing to genotype correlated with sample DNA concentration. SNP call rates per sample for each of the 3657 DNA samples were greater than 99.6% in patients and control subjects. Of the 1467 SNPs submitted for analysis, 1218 SNPs were satisfactorily genotyped (83%), with mean individual sample call rates of 99.7% and 99.8% in patients and control subjects, respectively. Of the 1218 SNP loci satisfactorily genotyped, 188 were fixed in all samples, leaving 1030 SNPs for which genotype data were informative.
Population stratification
Of the 1030 polymorphic nsSNPs, 55 were found to violate HWE in controls at the 5% significance level (expected number of failures, 52). After Bonferroni correction, 6 SNPs still violated HWE and were removed, leaving a total of 1024 for further analysis. Each of the 6 SNPs removed had low genotyping reliability scores. Table S1 details all MAF data in 2695 controls for each of the 1024 nsSNPs. Of the remaining SNPs that violated HWE at the nominal 5% level, none was associated (P < .05) with risk of CLL. Implementing the genomic control method indicated no evidence of population stratification in our data as a cause of false-positive results, as the 95% confidence interval for the stratification parameter
SNPs and risk of CLL
Statistically significant associations were identified for 49 of 1024 SNPs at the 5% level by means of the TA statistic, 3 of which were significant at the 0.1% level (Table 1). The test statistics TD and TR and ORs under dominant and recessive models were computed for 1024 and 886 SNPs with sufficient MAF, respectively.
Of the 49 SNPs showing significant association (PA ≤ .05), 2 SNPs have previously been documented to be functional: I157T in CHK2 checkpoint yeast homolog (CHEK2 [MIM 604373]), a cell-cycle checkpoint regulator, and P1054R in ataxia telangiectasia mutated (ATM [MIM 607585]), a cell-cycle checkpoint kinase required for cellular response to DNA damage. In addition, 1 SNP encodes a termination codon; S474X in lipoprotein lipase (LPL [MIM 238600]), and a further 31 SNPs are predicted by at least 1 in silico algorithm to be deleterious (Table 2).
ATM SNPs F858L (rs1800056) and P1054R (rs1800057), which are in strong LD, showed the most significant allelic association with CLL, with strongest association under a dominant model (ORD = 2.28; 95% CI, 1.53-3.40; PD < .0001; ORD = 1.68; 95% CI, 1.25-2.28; PD = .0006), respectively. After permutation analysis to adjust for multiple testing, ATM F858L (rs1800056) was found to still be significantly associated with CLL risk, with adjusted P = .03 at the genome-wide level. Additionally, the haplotype formed by the minor alleles of ATM F858L and P1054R was significantly overrepresented in patients compared with control subjects (ORD = 2.32; 95% CI, 1.56-3.45; PD < .0001, P = .01 after permutation testing).
Nine additional SNPs located within the DNA damage-response axis also showed significant association (Figure 1; Table 1). Pre-eminent SNPs on the basis of biologic relevance were I157T (rs17879961) in CHEK2 (ORD = 14.83; 95% CI, 1.85-8; PD = .0008), N372H (rs144848) in breast cancer 2 early onset (BRCA2 [MIM 600185]), a tumor suppressor involved in DNA double-strand break repair (ORR = 1.45; 95% CI, 1.13-1.86; PR = .0032), and Q349R (rs1801376) in BUB1 budding uninhibited by benzimidazoles yeast homolog 1 (BUB1B [MIM 602860]), encoding a kinase involved in spindle checkpoint function (ORR = 1.42; 95% CI, 1.12-1.81; PR = .0038).
Stratification of patients by sex, family history of the disease, and age at diagnosis (≤ 60 years, > 60 years) did not significantly affect study findings. We examined for interactive effects between the 49 SNPs significantly associated with risk of CLL (PA < .05) by fitting full logistic regression models for each pair, generating 1176 models, and comparing these with the main effects model. The strongest interaction was between BRCA2 N372H and EPH receptor A7 (EPHA7 [MIM 602190]) P278S (P = .0007), albeit nonsignificant after correction for multiple testing.
Discussion
We evaluated nsSNPs on the basis that each has the capacity to directly affect the function of expressed proteins, implying a higher probability of being directly causally related to susceptibility. Allelic loss in cells used in genetic analyses is a potential source of bias, because an apparent increase in homozygosity may be due to loss of heterozygosity in tumor leukocytes. There was no evidence of such confounding in our study as a source of spurious results, since the number of SNPs showing deviation from Hardy-Weinberg equilibrium followed the expected distribution, and associations were primarily based upon an overrepresentation of heterozygotes.
For 2 of the nsSNPs identified, CHEK2 I157T and ATM P1054R, there is evidence they are likely to directly affect the risk of malignancy. Furthermore, for an additional 32 of the SNPs significantly associated with CLL risk, the substitution either resulted in a termination codon or was predicted to be functionally deleterious using the in silico algorithms PolyPhen and/or SIFT. Although predictions about the functional consequences of amino acid changes are not definitive, these algorithms have been demonstrated in benchmarking studies to successfully categorize 80% of amino-acid substitutions.24
Through interrogation of the Pathway Assist program (Stratagene, La Jolla, CA), 11 of the 49 associated SNPs were found within genes encoding pivotal components of the ATM-BRCA2-CHEK2 DNA damage-response and cell-signaling pathways.
The 3 SNPs in ATM associated with increased risk of CLL, F582L, F858L, and P1054R, are each predicted to be deleterious. Heterozygosity for P1054R has been reported to be associated with decreased ATM expression in CLL;25 furthermore, cell lines from breast cancer patients harboring the linked heterozygous F858L and P1054R variants exhibited increased radiosensitivity.26 ATM 1054R has previously been associated with an increased risk of breast27,28 and prostate cancer.28 While the functional significance of F582L is unknown, this SNP has previously been reported to confer an elevated risk of acute lymphocytic leukemia.29
We have recently conducted a genome-wide linkage search of 115 families segregating CLL and other related B-cell LPDs but did not demonstrate significant linkage to ATM (P = .08).30 This observation is not contradictory to our current findings of an overrepresentation of the minor alleles of ATM F858L and P1045L in patients with CLL as the impact of ATM on the familial risk of CLL generated by both variants is approximately 1.03, insufficient to generate a significant departure in expected allele-sharing probabilities between affected individuals in the 115 families.
ATM is critical for regulation of cell-cycle checkpoints, and activation of ATM by DNA damage leads to ATM-dependent phosphorylation of CHEK2.31 CHEK2 I157T is localized in a functionally important domain of CHEK2, and the variant protein has been shown to be defective in its ability to bind TP5332 and BRCA1.33 Previously, CHEK2 I157T has been associated with increased risk of breast, colon, kidney, and prostate cancers.34 Furthermore, possession of 157T has been shown to confer a 2-fold increase in risk of NHL,34 supporting the role of inherited dysregulation of CHEK2 in the development of B-cell LPDs.
BRCA2 is involved in the monitoring and repair of DNA double-strand breaks.35 The minor allele of N372H has been documented to confer an elevated risk of breast36 and ovarian cancers.37 N372H is located between residues 290 and 453 of BRCA2, a region shown to interact with the transcriptional coactivator P/CAF,38 and hence has the potential to directly modify BRCA2-mediated regulation of transcription.
An additional 6 nsSNPs were identified in genes that interact either directly or indirectly with the ATM-BRCA2-CHEK2 DNA damage-response axis. These include SNPs D784V in EGF, I253M in insulin-like growth factor–binding protein 1 (IGFBP1 [MIM 146730]), and R574P in matrix metallopeptidase 9 (MMP9 [MIM 120361]), which are involved in Sp1-mediated down-regulation of ATM transcription by EGF.39 Despite the low minor-allele frequency of the SNPs individually associated with risk of CLL in our study, there was some evidence for an interaction between BRCA2 N372H and EPHA7 P278S (P = .0007), albeit nonsignificant after correction for multiple testing.
The prior probability of identifying a significant association with CLL risk for a series of SNPs mapping to a single gene pathway is intuitively small. Genotyping a total of 81 SNPs across 50 genes (including ATM, BRCA2, and CHEK2) implicated in the cell-cycle pathway via Gene Ontology Consortium annotations identified 8 SNPs displaying statistical association with risk of CLL, a significantly greater number than expected a priori (P < .05). By contrast, no significant associations were observed for SNPs mapping to genes encoding components of the cell-cell signaling (21 SNPs, 18 genes) and cell differentiation (20 SNPs, 17 genes) pathways.
Several lines of evidence support a role for inherited dysfunction in the ATM-CHEK2-BRCA2 axis as a cause of predisposition to CLL. Recessive ataxia telangiectasia (A-T), caused by mutations in ATM, is well established to confer a substantive increase in risk of LPD,40 and an overrepresentation of LPD has been documented in relatives of patients with A-T.41 Mutations in ATM, CHEK2, and BRCA2 are documented to confer an increased risk of breast cancer. This fact, coupled with the elevated risk of LPD reported in relatives of patients with breast cancer,1 suggests that a subset of breast cancers and LPDs have a common biology.
Our study provides evidence that inherited predisposition to CLL is in part mediated through low-penetrance alleles, specifically variants in the ATM-BRCA2-CHEK2 DNA damage-response axis. Clearly it is, however, desirable to validate our study findings through analysis of additional large datasets.
Prepublished online as Blood First Edition Paper, March 30, 2006; DOI 10.1182/blood-2005-12-5022.
Supported by Leukaemia Research, Cancer Research UK, the Arbib Foundation, National Cancer Research Network, and the European Union (CCPRB).
M.F.R. and G.S.S. contributed equally to this study.
An Inside Blood analysis of this article appears at the front of this issue.
The publication costs of this article were defrayed in part by page charge payment. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 U.S.C. section 1734.
We gratefully acknowledge the participation of all patients with CLL and control individuals. The authors are indebted to Ruth Allinson, Richard Coleman, Christina Fleischmann, Nicholas Hearle, Athena Matakidiou, Mobshra Qureshi, Hayley Spendlove, and Remben Talaban for sample ascertainment.