Abstract
Chronic lymphocytic leukemia (CLL) and other B-cell lymphoproliferative disorders display familial aggregation. To identify a susceptibility gene for CLL, we assembled families from the major European (ICLLC) and American (GEC) consortia to conduct a genome-wide linkage analysis of 101 new CLL pedigrees using a high-density single nucleotide polymorphism (SNP) array and combined the results with data from our previously reported analysis of 105 families. Here, we report on the combined analysis of the 206 families. Multipoint linkage analyses were undertaken using both nonparametric (model-free) and parametric (model-based) methods. After the removal of high linkage disequilibrium SNPs, we obtained a maximum nonparametric linkage (NPL) score of 3.02 (P = .001) on chromosome 2q21.2. The same genomic position also yielded the highest multipoint heterogeneity LOD (HLOD) score under a common recessive model of disease susceptibility (HLOD = 3.11; P = 7.7 × 10−5), which was significant at the genome-wide level. In addition, 2 other chromosomal positions, 6p22.1 (corresponding to the major histocompatibility locus) and 18q21.1, displayed HLOD scores higher than 2.1 (P < .002). None of the regions coincided with areas of common chromosomal abnormalities frequently observed in CLL. These findings provide direct evidence for Mendelian predisposition to CLL and evidence for the location of disease loci.
Introduction
B-cell chronic lymphocytic leukemia (CLL [MIM no. 151400]) accounts for approximately 25% of all leukemias and is the most common form of lymphoid malignancy in Western countries.1 Family2-4 and epidemiologic5-9 studies provide strong support for the familial aggregation of CLL and other related B-cell lymphoproliferative disorders (LPDs) such as non-Hodgkin lymphoma (NHL [MIM no. 605027]) and Hodgkin lymphoma (HL [MIM no. 236000])
The striking multiple-case families reported in the literature provide substantive evidence for an inherited predisposition to CLL2-4 and suggest the existence of susceptibility alleles with pleiotropic effects.2,10 Case-control and cohort studies that have systematically estimated the familial risk of CLL and other LPDs have shown that most B-cell LPDs display site-specific elevated familial risks,5-9 but particularly CLL, where risks are increased 3- to 7-fold in first-degree relatives of cases. Furthermore, such studies have demonstrated that familial associations exist between the different types of B-cell LPDs with risks of NHL and HL showing 2-fold increases in relatives of CLL cases.
These observations provide a strong rationale for searching for predisposition genes for CLL through linkage searches of multiple-case families. Two genome-wide linkage scans have been conducted to date. The first reported by Goldin et al11 in 2003 used 359 microsatellite markers to genotype 18 CLL families. In 2005, a second genome-wide scan of 105 families segregating CLL with or without additional B-cell LPD cases was conducted using the Affymetrix Mapping 10Kv131 array, which contained approximately 11 500 single nucleotide polymorphisms (SNPs).12 In both studies, analyses provided evidence for susceptibility at a number of loci, but none achieved statistical significance, suggesting that a much larger familial sample was required to identify CLL predisposition loci.
To address this, we have undertaken a further genome-wide linkage scan of an additional 101 families ascertained through the International CLL Consortium (ICLLC) and the Genetic Epidemiology of CLL (GEC) consortia. This search was conducted using high-density SNP arrays, thereby allowing us to pool findings with data generated from our previous scan of 105 families and in so creating a dataset of 206 families, representing the majority of CLL families identified worldwide. Here, we report further evidence for a Mendelian predisposition to CLL and strong evidence for the location of novel disease loci.
Patients, materials, and methods
Ascertainment and collection of families
For clarity, we refer to our previously reported genome-wide scan of 105 pedigrees reported12 as phase 1 and the current analysis of 101 pedigrees as phase 2. As for those in phase 1, phase 2 pedigrees consisted of families with B-cell CLL with or without the segregation of additional B-cell LPD cases. These families were ascertained through hematologists in the United Kingdom, United States, Norway, Israel, Italy, Germany, The Netherlands, Portugal, and Australia participating in the ICLLC (51 phase 2 families) and the GEC consortia (50 phase 2 families). The diagnoses of B-cell CLL and other B-cell LPDs in affected family members were established using accepted standard clinicopathological and immunologic criteria in accordance with current WHO classification guidelines.13 Blood samples were obtained from both the offspring and spouse of deceased affected family members wherever possible to facilitate the reconstruction of genotypes. DNA was extracted from venous blood samples using conventional methodologies. Research protocols and informed consents were obtained according to each group's institutional review board (Multi-Centre Research Ethics Committee UK; National Cancer Institute; Mayo Clinic College of Medicine; Moores Cancer Center, University of California, San Diego; University of Texas M. D. Anderson Cancer Center; University “La Sapienza,” Nepean Hospital) in accordance with the Declaration of Helsinki.
Genotyping
Prior to genotyping, all DNA samples were quantified by PicoGreen (Invitrogen, Paisley, United Kingdom). A genome-wide linkage search of the 101 families in phase 2 was undertaken using the GeneChip Mapping 10K 2.0 Xba Array containing approximately 10 200 SNP markers (Affymetrix, Santa Clara, CA). SNP genotypes were obtained by following the Affymetrix protocol for the GeneChip Mapping 10K 2.0 Xba Array. Briefly, 250 ng genomic DNA isolated from peripheral blood was digested per sample with the restriction endonuclease XbaI for 2.5 hours. Digested DNA was mixed with Xba adapters and ligated using T4 DNA ligase for 2.5 hours. Ligated DNA was added to 4 separate polymerase chain reactions (PCRs), cycled, pooled, and purified to remove unincorporated ddNTPs. The purified PCR products were then fragmented and labeled with biotin-ddATP. Biotin-labeled DNA fragments were hybridized to the arrays for 18 hours in an Affymetrix 640 hybridization oven. After hybridization, arrays were washed, stained, and scanned using an Affymetrix Fluidics Station FS450 with images obtained by use of an Affymetrix GeneChip 3000 scanner. Affymetrix GCOS software (v1.4) was used to obtain raw microarray feature intensities. Feature intensities were processed using Affymetrix GTYPE (v4.0) software to derive SNP genotypes (Affymetrix).
Data manipulation and error checking
The phase 1 genome-wide linkage scan had been undertaken using the GeneChip Mapping 10Kv131 Xba array containing 11 555 SNP markers (Affymetrix). Phase 2 samples were genotyped over 10 204 markers on version 2.0 of the Affymetrix 10K array. Pooled linkage analysis of the 206 families was based upon the 10 204 SNPs common to both arrays. The pedigree relationship-testing program PREST (release 3.0)14 was implemented to check for the detection of pedigree errors. Non-Mendelian error checking of genotypes and generation of linkage format files from raw Affymetrix array files was performed using the program ProgenyLab (Progeny, South Bend, IN). The map order and distances between SNP markers was based on the UCSC Human Genome browser (March 2006 release). The program MERLIN15 was used to further search for and remove additional unlikely genotypes consistent with potential genotyping errors.
Investigation of linkage disequilibrium
Most linkage software for multipoint analyses assumes that markers are in linkage equilibrium. However, for closely spaced SNP markers this is not always the case. To identify markers in high linkage disequilibrium (LD), we calculated the pair-wise LD measure r2 between consecutive pairs of SNP markers using the expectation-maximization algorithm to estimate 2-locus haplotype frequencies as previously described.12 A pair of SNPs was defined as being in high LD if they had a pair-wise LD measure of r2 higher than 0.16 in accordance with criteria recently advocated.16 Linkage disequilibrium was then removed by considering each set of markers in LD (defined as sets where each consecutive marker pair in the set had r2 > 0.16) and retaining one SNP from each set (the centrally positioned SNP). The impact of LD was investigated by considering linkage results calculated before and after the removal of the high-LD SNPs.
Linkage analysis
Multipoint linkage analysis was conducted by implementation of the Perl script SNPLINK,17 which performs fully automated nonparametric (mode-of-inheritance free) and parametric analyses before and after LD removal using the program ALLEGRO (v1.2).18 Although primary statistical analyses were based on NPL scores, parametric linkage in the presence of heterogeneity was assessed using heterogeneity LOD (HLOD) scores and their accompanying estimates of the proportion of linked families (α) estimated. These analyses require the specification of a disease-transmission model. We derived LOD scores under both dominant and recessive models of inheritance with reduced penetrance and 2 age categories dependent upon age at diagnosis (<65 and 65+ years). In the absence of a genetic model, we adopted a pragmatic approach to this analysis choosing values that were consistent with the population age-specific risks of CLL and compatible with the range of familial risks. The lifetime risk (defined at age 84 years) for being diagnosed with CLL in the U.S. population using the SEER registry data is estimated to be approximately 0.37%.19 We assumed an allele frequency of either 0.005 or 0.05 under the dominant models, and 0.05 and 0.20 under the recessive models. To satisfy the constraints of the lifetime risk and familial relative risks, for the dominant models the penetrance of the rare and common alleles were assumed to be 4.2% and 2.8%, respectively for individuals aged younger than 65 years and 9.0% and 6.0%, respectively, for those older than 64 years. For the recessive models, penetrance of the rare and common alleles was assumed to be 14.0% and 7.0%, and 30.0% and 15.0%, respectively, for the 2 liability classes. To allow for phenocopies, the penetrance of the normal genotypes under all models was set to 0.14% and 0.3%, respectively, for the 2 respective liability classes. All unaffected individuals were considered uninformative (ie, of unknown phenotype) in the analysis.
Heterogeneity LOD scores follow a complex statistical distribution, which can be approximated by the maximum of 2 independently distributed χ2 variables. To obtain significance estimates for HLODs, these were first converted to a χ2, where χ2 = 2 loge10 × HLOD and significance values (P1) were then derived, using the χ2 distribution with one degree of freedom. The nominal P value for the HLOD score is then given by: 0.5 × [1 − (1 − p1)(1 − p1)].20
Results are reported in terms of an NPL statistic and its associated one-sided P value. Under the null hypothesis of no linkage, the NPL statistic is distributed asymptotically as a standard normal random variable. An estimate of the information content (IC) for each chromosome before and after high LD SNP removal was determined by use of marker set entropy information derived by MERLIN.21
Results
Description of families analyzed
The 206 families included in phase 1 (n = 105) and phase 2 (n = 101) comprised 155 CLL families and 51 families segregating CLL and other B-cell LPDs (Table 1). Within the 206 families, there were 487 individuals affected with CLL and 63 individuals affected with NHL or HL. A higher proportion of families in phase 2 were multigenerational compared with those in phase 1 (Table 1). The difference in composition of families between the 2 phases is not a consequence of predefined criterion for ascertainment of families, but is reflective in part of a consequence of the ongoing development of ICLLC and GEC. Overall, 42% of the 206 families contained 3 or more affected individuals.
The median age at diagnosis of CLL in the 206 families was 60 years, significantly less than the median value of 72 years for age at diagnosis observed in the general white population.19 Minimum age at diagnosis within a family is likely to be a superior indicator of the potential for existence of a susceptibility gene, since it is not influenced by older sporadic cases. In our families, the minimum age of diagnosis within the families ranged from 28 years to 81 years with a median value of 56 years.
Within phase 1, 203 (85%) of 238 family members affected with CLL were genotyped together with 17 (77%) of 22 of those affected with LPD and 3 unaffected individuals. In phase 2 families, 101 (41%) of 249 individuals affected with CLL and 22 (54%) of 41 of those affected with LPD were genotyped. In addition 51 unaffected family members were typed primarily to reconstruct genotypes of unavailable affected family members.
Data quality
In addition to the 223 Affymetrix 10K131 arrays run and used in the phase 1 analyses, a total of 171 Affymetrix 10Kv2.0 arrays were processed in phase 2. A number of parameters were used throughout the study to determine data quality, and all genotypes were housed within the pedigree storage program ProgenyLab. The average SNP call rate per array for phase 2 was 98.0% compared with 92.8% for phase 1. For DNA extracted from males, it was possible to examine the 309 markers on the X chromosome for errors due to miscalls or PCR contamination. No SNPs were heterozygous in male samples. Two hundred seventy-three markers were fixed or were without a single map location, leaving 9933 usable SNPs (97.3%), of which 9690 mapped to autosomes. After LD removal 7495 (77.4%) of 9690 markers remained. Less than 0.4% of the total SNP genotypes generated were considered unlikely by ProgenyLab and/or MERLIN. All such genotypes were removed from further analyses.
Linkage analysis
The IC derived for the phase 1 analyses from using only the 10 204 SNPs contained within the Affymetrix 10Kv2.0 array was not significantly different from that obtained using all original 11 555 markers on the 10Kv131 array. It is known that the presence of LD between markers can inflate multipoint linkage statistics if the vectors of inheritance have to be inferred on the basis of allele frequencies22,23 and where founders of many of the pedigrees are not available to genotype.
Multipoint nonparametric linkage analysis of all 206 families with and without the high-LD SNPs is shown in Figure 1. The panels within Figure 1 show that inclusion of high-LD SNPs in the analysis can lead to inflated linkage statistics; however, in most cases, the overall profile of the linkage statistics remains the same. Genome-wide mean IC scores were virtually identical with and without inclusion of high-LD SNPs in phase 1, phase 2, and the combined dataset (combined dataset: 0.645 before and 0.632 after LD removal).
Table 2 details the maximal NPL scores attained after removal of LD for all autosomes in phase 1, phase 2, and in the combined dataset. The best evidence for linkage was confined to 2q21.2, 5q23.2, 6p22.1, 11q12.1, and 18q21.1. Figure 2 shows transformed multipoint HLOD scores (− log10[P value]) generated using the most parsimonious dominant and recessive models and corresponding transformed multipoint NPL scores for these 5 chromosomes. The maximum NPL score obtained was 3.02 with a corresponding nominal P value of .001 at map position 2q21.2 (Figure 1). At the same position, a genome-wide significant HLOD of 3.11 (P = 7.7 ×10−5) under a common recessive model was obtained with 68% of families showing evidence of linkage. Support for the 2q21.2 locus was provided by both phase 1 (NPL = 1.64, HLOD = 1.26) and phase 2 (NPL = 2.60, HLOD = 1.75) data. In addition to chromosome 2, the 4 regions on chromosomes 5q23.2, 6p22.1, 11q12.1, and 18q21.1 attained significance levels compatible with thresholds recommended for genome-wide suggestive linkage24 (Tables 2,3; Figure 2). For each of the regions there was limited evidence that linkage was primarily generated by any specific families.
For chromosome 6, the best-fitting model was attained imposing a common recessive allele with 72% of families being linked, with support coming from both phases (Table 2). HLODs for phase 1 and phase 2 were 1.35 and 1.22, respectively. For chromosomes 5 and 11, the best-fitting model was attained imposing a rare recessive allele with 85% and 82% of families being linked, respectively (Tables 2,3). Support for chromosome 5 linkage was not biased to either phase 1 or phase 2, but the region at which maximal linkage was attained was inconsistent. Similarly for chromosome 11, the majority of the support for linkage came from phase 1 data (NPL = 2.66, P = .004) and maximal linkage obtained at different chromosomal locations (Table 2). In contrast, for chromosome 18q21.1, the best-fitting model was attained imposing a rare dominant allele with 68% of families being linked, with most of the evidence coming from phase 2 data (NPL = 2.81, P = .003).
Discussion
Following publication of 2 previous linkage studies that failed to identify significant linkage, we combined the extant families from diverse institutions worldwide, and 2 existing consortia, to generate the largest collection of familial CLL to date.
Our results provide evidence for a major susceptibility locus on chromosome 2 influencing the risk of CLL—with characteristics consistent with an autosomal recessive model of inheritance. We did not find any significant evidence for linkage in the combined dataset to any of the regions of the genome commonly associated with cytogenetically detectable chromosomal losses (6q, 13q14, or 17p) or gains (trisomy 12) in CLL.25-27 In addition to linkage to 2q21.2, we found evidence of a recessively acting locus for CLL mapping to 6p22.1 and a dominantly acting locus mapping to 18q21.1 on the basis of presumptive Mendelian models of predisposition.
Here, we have made use of data generated from high-density SNP arrays to search for CLL predisposition loci by linkage. In addition to affording maximal power to detect linkage, the output from such arrays permits pooling of data from different scans to be efficiently conducted, avoiding the serious problems of microsatellite-based searches. The combined dataset of 206 families has permitted us to robustly identify a novel locus on chromosome 2 that had displayed linkage only at the 1% level in our previous search. Furthermore, we have increased evidence for linkage to chromosome 6 in a region that includes the HLA locus. Maximal evidence of linkage to 2q21.2 and 6p22.1 under assumption of recessive transmission may, however, in part be a consequence of the high proportion of the families analyzed containing affected sibships that favor recovery of a recessive model.
Although high-density SNP arrays represent a milestone in linkage analysis, the presence of LD between SNPs does, however, have the consequence of potentially inflating linkage statistics. While there is no definitive consensus on the thresholds to be used to manage the issue of LD between SNPs, we excluded SNPs with high LD, defined as those with a pair-wise linkage disequilibrium measure of r2 more than 0.16. This can be viewed as conservative but is a threshold for triaging SNPs in high LD, which has been recently recommended.16 When we originally reported analysis of the first 105 families,12 we imposed a less stringent criterion, advocated at the time of r2 more than 0.40.28 Given that high LD between SNP markers impacts on linkage statistics but does not result in loss of information content within our dataset, we strongly endorse imposing stringent thresholds when using high-density arrays for linkage analyses.
Although speculative at this juncture, several interesting candidate genes involved in aspects of the regulation of cellular proliferation and differentiation of B cells map to the regions of linkage on 2q21 and 18q21. The region identified on chromosome 2 includes the chemokine receptor gene (CXCR4) whose expression is higher in CLL cells and that is thought to be associated with disease progression.29 Levels of CXCR4 have also been associated with Rai stage30 and with survival in familial CLL.31 CXCR4 germ-line mutations are responsible for the warts, hypogammaglobulinemia, infections, and myelokathexis syndrome (WHIM; MIM no. 193670). The chromosome 18 region contains the SMAD7 gene (mothers against decapentaplegic, drosophila, homolog of, 7; MIM no. 602932) whose expression has been implicated in growth arrest and apoptosis of B-lineage cells and Ig class switching.32-34 It is also intriguing that we found support for involvement of the MHC region by virtue of linkage at 6p22.1. A support for HLA alleles in the development of B-cell LPD is provided by the observation of linkage in sibships with Hodgkin lymphoma,35 and some previous association studies have also implicated variants within or close to the MHC class II region in susceptibility to CLL.36
Reduced expression of death-associated protein kinase 1 (DAPK1) through epigenetic silencing by promoter methylation and histone tail modification has been reported to occur in the majority of sporadic CLL cases. A rare, single-nucleotide germ-line mutation (c.1–6531A>G) upstream of DAPK1, which maps to 9q21.33, has recently been reported to segregate with CLL in a large family, suggesting that heritable predisposition to CLL may in part be mediated through germ-line variation in DAPK1.37 The contribution of inherited mutations in DAPK1 to familial risk is unclear; however, in our analyses, we found no evidence of linkage to this region of 9q21 (either in the complete dataset or in a restricted analysis based on only larger pedigrees with affection status solely defined by CLL and with 4 or more affected individuals), suggesting the contribution of this locus to the overall familial aggregation of the disease is small.
Our results suggest that more than one gene is contributing to risk of CLL in families. Such loci could be epistatic or acting independently. The observation of subclinical levels of monoclonal B-cell lymphocytosis (MBL) with an identical phenotype to indolent CLL detectable in 3% of healthy individuals but 14% of first-degree relatives in high-risk CLL families38 suggests this phenotype is a marker of genetic risk and may be an early event in the oncogenic process, consistent with a model based on epistatic interaction. As only a paucity of individuals from the 206 pedigrees have been tested for this phenotype, it was not possible to make use of MBL status in our current analysis. Future mapping studies of high-risk families incorporating data on MBL status on all available family members are therefore desirable to better characterize the model.
In conclusion, follow-up of linkage signals on 2q21, 18q21, and 6p22 is warranted along with screening of individuals for the presence of the MBL phenotype. In conjunction with conventional fine mapping of loci, as has been shown for DAPK1,37 it may be possible to also make use of expression data to identify novel disease genes. This should be possible through the ongoing collection of families from ICLLC and GEC consortia, as well as available population-based case-control collections.
An Inside Blood analysis of this article appears at the front of this issue.
The publication costs of this article were defrayed in part by page charge payment. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Acknowledgments
Grant support for the ICLLC and work at the Institute of Cancer Research was provided by Leukemia Research, the Arbib Foundation, and Cancer Research UK. The work of the GEC is supported by grant CA118444 from the National Cancer Institute (NCI) and by the Intramural Research Program of the NIH, National Cancer Institute.
We are grateful to all patients and their families for participation in this study. We thank all the clinicians for participating in the ICLLC and the GEC consortia, specifically, in ICLLC: Drs Robin Aitchison, Petra Antunovic, Jenny Arnold, Hasan Atrah, Martin Auger, Andrew Bell, Isaac Ben-Bassat, Alain Berrebi, Lee Bond, Mary Cahill, Silvano Capalbo, John Catalano, Claire Chapman, Patricia Chipping, Patricia Clark, Rosa Collado, Clare Dearden, Helen Dignum, Ian Douglas, Julio Esteban, Savio Fernandes, Elizabeth Gaminara, Milagros Garcia Diaz, Alfonzo Garcia de Coca, Lia Ginaldi, James Hamilton, Paul Hayes, Fredrick Jackson, Steven Johnson, Maria Junior, Eric Kanfer, Daniel Kennedy, Christopher Knechtli, Anil Lakhani, Maeve Leahy, Ray Lowenthal, Arumugam Manoharan, Leonora Mehes, Sophie Mepham, Jane Merceira, Ann Miller, Alison Milne, Philippe Mineur, Godfrey Morgenstern, Anne Morrison, Richard Murrin, Ann Nandi, Anne Parker, Kanthi Perera, Klas Quabeck, Saad Rassam, Cecil Reid, Isabel Ribeiro, Colin Rist, Richard Rosenquist, Martin Rowlands, Pinhas Stark, Rhona Stewart, Robert Stockley, Paul Stross, Geoffrey Summerfield, Helen Sykes, Daniel Thompson, Christopher Tiplady, Marilyn Treacy, Virginia Tringham, Eric Van Den Neste, David Westerman, Nicholas Wickham, James Wiley, and Barrie Woodcock; and in GEC consortia, Laura Fontaine, Fatima Abbasi, Maria Sgambati, Ola Landgren, David Ng, Jorge Toro, Mary Lou McMaster, and Joseph F. Fraumeni Jr, for their work with NCI's families. We also recognize and thank Drs James Cerhan, Celine Vachon, Neil Kay, as well as Marcia Mahlman, for their work with Mayo Clinic families. Finally, we are grateful to Emily Webb for statistical advice.
National Institutes of Health
Authorship
Contribution: G.S.S. designed and performed research, analyzed and interpreted data, and drafted the paper; L.R.G. designed research, collected data, contributed families, analyzed and interpreted data, and drafted the paper; R.W.W. designed and performed research; S.L.S. and R.S.H. designed research, collected data, contributed families, analyzed and interpreted data, and drafted the paper; L.R., S.S.S., F.R.M., G.E.M., S.F., M.L., T.K., M.J.K., and T.G.C. collected data and contributed families; M.J.S.D. collected data, contributed families, and drafted the paper; D.C. and N.C. designed research, collected data, contributed families, and drafted the paper.
Conflict-of-interest disclosure: The authors declare no competing financial interests.
Correspondence: Richard Houlston, Section of Cancer Genetics, Institute of Cancer Research, 15 Cotswold Road, Sutton, Surrey, SM2 5NG, United Kingdom; e-mail: richard.houlston@icr.ac.uk; Neil Caporaso, Pharmacogenetics Section, Genetic Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, EPS 7116, 6120 Executive Blvd, Rockville, MD 20892; e-mail: caporasn@exchange.nih.gov.