• Common genetic variation in the HLA region modulates the risk of developing AML in patients with normal cytogenetics.

  • This genetic risk is stronger in patients with NPM1 mutations.

Abstract

Acute myeloid leukemia (AML) is the most common type of acute leukemia in adults. Genome-wide association studies have identified 4 common inherited variants associated with AML risk, but these findings have not yet been confirmed in many independent data sets. Here, we performed a replication study with 567 AML cases from the Leucegene cohort and 1865 controls from the population-based cohort CARTaGENE (CaG). Because genotypes were generated using different technologies in the 2 data sets (eg, low- vs high-coverage whole-genome sequencing), we applied stringent quality-control filters to minimize type 1 errors. We showed, using data reduction methods (eg, principal component analysis and uniform manifold approximation and projection), that our approach successfully integrated the Leucegene and CaG genetic data. We replicated the association between cytogenetically normal AML and rs3916765, a variant located near HLA-DQA2 (odds ratio, 1.96; 95% confidence interval, 1.26-3.06; P = .003). The effect size of this association was stronger when we restricted the analyses to patients with AML with NPM1 mutations (odds ratios of >2.25). We also found that rs3916765 is a whole-blood expression quantitative trait locus for HLA-DOB (P = 1.05 × 10−14) and HLA-DQA2 (P = 2.23 × 10−4). Our results confirm that a common genetic variant at the HLA locus associates with AML risk and the expression of HLA class 2 genes, providing new opportunities to improve disease prognosis and treatment.

Acute myeloid leukemia (AML) is characterized by the dysregulated clonal expansion of malignant myeloblasts in the bone marrow.1 Although it is one of the most common leukemias in adults, it represents only 1% of all cancers.1 In 2019, 1160 Canadians were diagnosed with AML.2 AML is highly lethal in older adults and very aggressive in younger individuals, with a 5-year relative survival rate of ∼32%.3 It is a genetically complex disease, with distinct AML subgroups defined by specific genetic abnormalities.4 Different classes of germ line genetic variation can influence the risk to develop AML, including monogenic predispositions, as well as more common genetic variants with individually modest impact on AML risk.4,5 Although the best-established associations with AML are monogenic in nature, recent genome-wide association studies (GWAS) have identified 4 genome-wide significant susceptibility loci for AML, including variants in or near KMT5B, IRF4, HLA-DQA2, and BICRA.1,6-8 The variant near HLA-DQA2 is associated with cytogenetically normal AML (CN-AML).6 Additional replication in independent AML case-control cohorts would reinforce the evidence that these common, modest effect variants are bona fide AML risk factors.

One challenge when working with an uncommon disease such as AML is to have a sufficiently large sample size to enable the discovery or replication of weak-to-moderate effect size genetic variants. Given a fixed budget, one approach consists of sequencing (or genotyping) only the genome of patients with AML, and using publicly available nonaffected cohorts as control samples. However, combining genetic data from cases and controls that were generated using different approaches or at a different time represents a difficult analytical task because it can result in technical batch effects that can lead to an increase type 1 error rate.9-11 

We performed a replication study of AML susceptibility GWAS loci using genetic data from an AML cohort and an external set of controls. Specifically, we used 567 patients of European ancestry from the Leucegene cohort as AML cases.12 These patients were recruited in different centers from Quebec (Canada) and have low-pass whole-genome sequence (WGS) data available (mean coverage, 3.25×) of either a tumoral sample, or both a tumoral and a normal sample. Because many of the Leucegene participants are French Canadians, a founder population,13 we selected controls from the CARTaGENE (CaG) population-based cohort. The CaG includes ∼43 000 participants recruited between the ages of 40 and 69 years from several regions of Quebec, including Greater Montreal.14 There are currently 2173 CaG participants with high-coverage WGS data available (mean coverage, >30×), including 1756 French Canadians, as well as 163 and 131 individuals with 4 grandparents born in Haiti and Morocco, respectively. We merged the Leucegene and CaG data to replicate the association between AML risk and the known (ie, GW significant) GWAS AML variants. In exploratory analyses, we also analyzed whole-blood expression quantitative trait loci (eQTL) and the association between AML and polygenic risk scores (PRSs) calculated using variants from leukemia or blood-cell traits GWAS.

WGS and whole-exome sequence (WES) of Leucegene participants

WGS data of the cohort were first published and described elsewhere.15 Exome germ line data were generated using Agilent’s SureSelect kit following manufacturer’s instruction. Libraries were sequenced on Illumina NovaSeq with paired-end 100 base pairs.

Genotype imputation using low-depth WGS data

Low-pass WGS (mean coverage of 3.25×) was available for 649 patients with AML of the Leucegene cohort (https://data.leucegene.iric.ca/). Of those, both tumoral and normal WGS was performed in 334 patients, only tumoral WGS in 311 patients, and only germ line WGS in 4 patients. Sources of tumoral cells were either blood or bone marrow for tumor samples, and sources of normal cells were either saliva or buccal swabs. Because some patients possess a pair of samples, we combined their tumoral and normal sample bam files with samtools 1.1716 before genotype imputation. We recalibrated bases with GATK (Genome Analysis Toolkit) 4.2.5.0 BaseRecalibrator and ApplyBQSR.17 As a reference panel for genotype imputation with Glimpse 1.1.1,18 we used the phased 1000 Genomes phase 3 hg38 data (available at: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/).19 We used a Nextflow pipeline (https://github.com/CERC-Genomic-Medicine/glimpse_pipeline) that runs GATK HaplotypeCaller17 to generate genotype likelihoods at reference sites before Glimpse imputation. We removed multiallelic variants from the reference panel with bcftools 1.11,16 because Glimpse can only handle biallelic variants. From the call set, we removed monomorphic variants and kept variants with an imputation quality at R2 ≥ 0.95. To validate our Glimpse variant calls, we looked at the genotypic concordance between Glimpse calls and the Leucegene WES data set. A total of 336 Leucegene patients had their exome sequenced in 4 sequencing batches of 117, 123, 70, and 26 patients. To identify variants in WES data, we mapped reads to the genome using BWA-MEM (Burrows-Wheeler Aligner Maximum Exact Matches),20 performed a cleaning and recalibration step using GATK, and called the variants using Mutect2 (version 4.1.3.0).17 We compared each batch separately. We removed variants from the WES data set with >5% or 10% genotype missingness. We used Plink2 to measure the concordance.21,22 With the 5% genotype missingness filter, the genotype concordance was 99.15%, 98.7%, 98.0%, and 98.1% for the 4 batches, respectively, whereas it was 98.3%, 98.2%, 97.5%, and 97.1%, respectively, with the 10% genotype missingness filter. We also looked at the genotype concordance of common variants (with minor allele frequency [MAF] of >1% or >5%). For the sequencing batch with 123 patients, the genotype concordance was 97.9% for variants with a MAF of >1%, and 97.3% for variants with a MAF of >5%.

We performed standard quality-control (QC) steps of the Leucegene data set using Plink2. Using variants in linkage equilibrium, we looked at the heterozygosity of our samples, the concordance between reported and inferred sex, cryptic relatedness (kinship), and applied data reduction methods (principal component analysis [PCA] and uniform manifold approximation and projection [UMAP]) to visualize the data. We did not filter on genotype or sample missingness because there are no missing genotypes from the genotype imputation with Glimpse. We removed 5 patients because their heterozygosity was high (F coefficient estimate less than −0.4), 6 patients because the declared sex did not match the inferred sex (F coefficient estimate ≥0.7 for males and F coefficient estimate <0.7 for females), 3 patients who were another patient’s first-degree relative (kinship coefficient of >0.177), and 37 patients who were duplicated (patients who had sequencing done twice to see the progression of the tumor or because of a relapse; kinship coefficient of >0.354). For each pair of duplicates, we kept the sample that was sequenced first. For 3 of 6 patients with sex mismatch, the mismatch between reported sex and variant-based sex is explained by the loss or gain of X chromosome(s) in the tumor. Based on the karyotype of the tumor, 2 of these patients are women for whom the tumor lost 1 and 2 of its X chromosomes, and the third patient is a man with an XXY tumor karyotype. After QC, 598 patients were left for downstream analyses.

Inferring the genetic ancestry of Leucegene participants

Because self-declared ethnicity information about Leucegene patients was not collected, we inferred their genetic ancestry. We first projected Leucegene patients on the 1000 Genomes phase 3 data set (n = 2548) and used UMAP and PCA visualization to assign each patient to a 1000G superpopulation (European, African, American, East Asian, or South Asian). To confirm our assignment, we used these superpopulations as the SIRE (self-identified race/ethnicity) variable for HARE (harmonizing genetic ancestry and self-identified race/ethnicity) 23 but no Leucegene individual had their superpopulation changed. As expected, most patients are of European ancestry (94.81%, 567 patients), whereas 10 patients are of American ancestry (1.67%), 11 patients of African ancestry (1.84%), 8 patients of East Asian ancestry (1.34%), and 2 patients of South Asian ancestry (0.33%). For the projection, we extracted variants common to the 1000G and the Leucegene data sets with Plink2 and merged them with Plink1.9.22,24 We kept variants in linkage equilibrium for the calculation of the PCs.

AML tumoral subgroups classification

Tumors were classified as being CN or not and having mutations in NPM1 or not (see supplemental Table 1). The cytogenetic information about the tumors was provided by the Quebec Leukemia Cell Bank. NPM1 mutations were assessed with WES or the RNA-sequencing data for every patient.25 

WGS of CaG participants

A total of 2184 CaG participants (including 1889 participants of European ancestry) were selected for high-coverage WG DNA sequencing. The sequencing took place at the Genome Quebec Centre d’Expertise et de Services on Illumina NovaSeq sequencers using polymerase chain reaction–free libraries and a paired-end (2 × 150 bp) protocol. We used the Illumina Dragen pipeline to call and genotype single-nucleotide variants and insertions-deletions. After QC steps, 2173 individuals were kept for downstream analyses.

Integration of the Leucegene and CaG genetic data sets

The integration of the CaG and Leucegene data sets was done with Plink. We first normalized the position of the variants with the hg38 reference in the data sets separately and set the reference allele to be the one present in the hg38 reference genome. To control for technical differences between the cohorts, we applied strict filters on the individual cohorts before the merge. We kept individuals of European ancestry and variants with a MAF of ≥1%. We removed variants with an HWE (Hardy-Weinberg Equilibrium) P value <1 × 10−3 and variants for which the distance between the AF in the respective cohort and in gnomAD (Genome Aggregation Database) non-Finnish European (NFE) was superior to 0.06 for Leucegene and 0.055 for CaG (distance = |gnomAD_NFE AF − cohort AF| / √2; variants that were not found or that were monomorphic in gnomAD NFE were removed). Individuals with European ancestry in CaG were defined using self-declared ethnicity, and PCA and UMAP projections on the 1000 Genomes Project data set. Individuals with European ancestry in Leucegene were defined using PCA and UMAPs on the 1000 Genomes project data set and with HARE (above). After applying the strict filters, there were 8 274 789 variants in Leucegene (8 328 109 before filtering) and 9 319 078 variants in CaG (9 939 090 before filtering). We extracted the variants that were common to both data sets before combining the cohorts. After the merge, there were 7 658 313 variants, 567 AML cases from Leucegene and 1865 controls from CaG.

Case-control association tests

We performed replication analyses for 3 patient subgroups: (1) CN-AML (n = 240), because the association with rs3916765 was initially discovered in CN-AML cases by Lin et al6; (2) NPM1-mutated AML (n = 196); and (3) CN-NPM1-mutated AML (n = 157); the latter 2 subgroups because there is a large overlap between cases with CN-AML and patients with NPM1-mutated AML (65% of Leucegene cases with normal karyotype also have NPM1 mutations). A total of 1865 CaG participants were used as controls for each analysis. We used logistic regression in Plink2 for association testing. We used age, sex, and the 10 first PCs as covariates. The quantitative covariates were rescaled with the “covar-variance-standardize” option in Plink2. We defined statistical significance as α = .025 (Bonferroni correction for 2 tested variants [0.05/2]). We performed power calculations using Quanto.26 

Sensitivity analysis

To mirror imputation inaccuracies, we conducted a sensitivity analysis in which we introduced noise in the genotype of Leucegene patients. We first created a vector of 10 000 genotypes for rs3916765 following the probabilities from the Hardy-Weinberg equilibrium. We randomly selected 3% of the Leucegene cases with CN-AML and randomly assigned them a rs3916765 genotype from the simulated vector. Then, the association between rs3916765 and CN-AML was tested with CaG participants and Leucegene patients (same model as earlier). In 1000 permutations, the association between rs3916765 and CN-AML always remained significant at α = .025.

eQTL analysis

The transcriptomic data from Leucegene have been described previously27 and are available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE232130. Reads were mapped to the genome using Star (version 2.7.1) 28 and quantified using RSEM (RNA-seq by Expectation-Maximization) (version 1.3.2).29 RSEM generated pseudocount that were used for differential gene expression analysis. Genes whose pseudocount sums was <10 over the whole cohort were removed. Gene counts were log-normalized (log10[count +1]) and the association between gene expression counts and genotype at rs3916765 was tested by linear regression in Plink2 (the effect allele is the risk allele G). We used age, sex, and 10 PCs as covariates; quantitative covariates were rescaled with the covar-variance-standardize function. The P values were adjusted with the Benjamini-Hochberg (false discovery rate) procedure in R (p.adjust function).30 The rs3916765 genotype–specific expression in whole blood from GTEx (Genotype-Tissue Expression) (GTEx Analysis Release version 8) was obtained from the GTEx Portal on 20 September 2024.31 

PRSs

PRSs were calculated with Plink2 on the merged data set with strict filters. For the 11 blood-cell traits, we used the variants from Chen et al32 and their European-ancestry variant weights, including 150 variants for basophil count, 346 for eosinophil count, 356 for hematocrit, 454 for mean corpuscular volume, 394 for monocyte count, 392 for mean platelet volume, 352 for neutrophil count, 553 for platelet count, 448 for red blood cell count, 340 for red cell distribution width, and 443 for white blood cell count; of which 112, 279, 321, 387, 340, 351, 288, 457, 384, 297, and 365, respectively, were present in the merged data set with strict filters. For the acute lymphocytic leukemia (ALL) and chronic lymphocytic leukemia (CLL) PRS, we used variants from Berndt et al,33 including 15 variants for ALL and 43 for CLL, of which 12 and 32 were present in our merged data set, respectively. All variants included in the PRS were lifted over from build hg19 to hg38 with the online LiftOver tool from the UCSC (University of California, Santa Cruz) Genome Browser.34 We applied an inverse normal transformation to the PRS and did a logistic regression to test their association with AML, correcting for age, sex, and the first 10 PCs. We defined statistical significance as α = .0038 (Bonferroni correction for 13 tested PRSs [0.05/13]).

Ethics

The Montreal Heart Institute Ethics Committee gave ethical approval for this work (project no. 2017-2247).

Integration of Leucegene AML cases and CaG controls

The demographics of the Leucegene and CaG cohorts are in supplemental Table 1. To call genotypes in the Leucegene cohort, we used Glimpse and phased haplotypes from the 1000 Genomes Project (see “Methods”).18,19 We applied stringent filters to remove poorly imputed variants and samples with low-quality sequence data to obtain a data set that totals 16 600 819 variants and 598 samples (supplemental Figure 1 and “Methods”). To validate the Glimpse variant calls, we calculated the genotype concordance rate with WES variant calls available from 336 Leucegene participants: across the 4 WES batches, concordance was high (>97%; see “Methods”), suggesting that we can accurately call genotypes from low-pass WGS data.

To avoid false-positive associations due to technical differences between the Leucegene and CaG data sets, we applied additional filters before merging the genotype data. We restricted downstream analyses to individuals of European ancestry (567 AML cases and 1865 controls) and to 7 658 313 common variants (MAF of ≥1% in the combined data set) with consistent allele frequencies in the gnomAD NFE population (supplemental Figure 2; “Methods”).35 After merging the Leucegene and CaG data sets, we applied data reduction methods (PCA and UMAP) to visualize potential population stratification.36 First, we confirmed that the Leucegene and CaG participants cluster with European-ancestry individuals from the 1000 Genomes Project (Figure 1A-B). Second, we repeated these analyses with only the Leucegene and CaG participants: both cohorts are well integrated together, except for a cluster of 55 individuals (54 AML cases and 1 CaG control) that cluster separately from the main population (Figure 1C-D). Additional analyses using the different subpopulations from the 1000 Genomes Project’s European superpopulation showed that these 55 individuals clustered with the Tuscan samples, suggesting that they might be Quebec residents of Italian descent (supplemental Figure 3).

Figure 1.

Genetic ancestry in the AML Leucegene and CaG cohorts. PCA (A,C) and UMAP representations (B,D) of the Leucegene and CaG participants, with (A-B) or without (C-D) individuals from the 1000 Genomes Project. Superpopulation refers to the 1000 Genomes Project labels (AFR, AM, EAS, EUR, and SAS). CREGION refers to the region of recruitment for the CaG participants (Leu for patients with AML). In red are outlier Leucegene and CaG samples that cluster with the Italian samples (TSI) from the 1000 Genomes Project (see also supplemental Figure 3). AFR, African; AM, American; EAS, East Asian; EUR, European; Leu, Leucegene; SAS, South Asian.

Figure 1.

Genetic ancestry in the AML Leucegene and CaG cohorts. PCA (A,C) and UMAP representations (B,D) of the Leucegene and CaG participants, with (A-B) or without (C-D) individuals from the 1000 Genomes Project. Superpopulation refers to the 1000 Genomes Project labels (AFR, AM, EAS, EUR, and SAS). CREGION refers to the region of recruitment for the CaG participants (Leu for patients with AML). In red are outlier Leucegene and CaG samples that cluster with the Italian samples (TSI) from the 1000 Genomes Project (see also supplemental Figure 3). AFR, African; AM, American; EAS, East Asian; EUR, European; Leu, Leucegene; SAS, South Asian.

Close modal

Replication of variants associated with AML risk

Because statistical power to identify genome-wide significant associations with AML risk is limited given our sample size (supplemental Figure 4), we focused our effort on the replication of previous genome-wide significant AML variants. Lin et al6 reported 2 genome-wide significant susceptibility loci for AML in a meta-analysis that totaled 4018 AML cases and 10 488 controls: rs4930561 in KMT5B is associated with all AML types, and rs3916765 near HLA-DQA2 is associated with CN-AML. In our Leucegene-CaG data set, we replicated the association between HLA-DQA2-rs3916765 and CN-AML (G-allele odds ratio [OR], 1.96; P = .0031; Table 1). This association between rs3916765 and CN-AML remained significant after excluding the aforementioned 55 outlier samples (P = .006) or in a sensitivity analysis in which we introduced noise in the genotype to mirror imputation inaccuracies (see “Methods”; supplemental Figure 5). The AML variant in KMT5B did not replicate in our study at our predefined statistical threshold (α = .025), but it is nominally associated with CN-AML (P value = .039; Table 1). Two more variants have been reported to be associated with AML at GW significance: rs75797233 near BICRA, and rs12203592 in an intron of IRF4.7,8 rs75797233 was not imputed by Glimpse, maybe because of its low MAF (0.02).8 rs12203592 was also not associated with AML risk (P = .50), but we note that this variant was initially excluded from association testing because its Glimpse imputation quality was just below the quality threshold used in our study (imputation R2 = 0.898). Additional studies have reported suggestive genetic associations with AML risk (ie, that did not reach GW significance).7,8,37,38 Although not the primary focus of our study, we queried these variants in our case-control data set but did not find associations that reached nominal significance (all P > .05; supplemental Table 2).

Table 1.

Replication results in the Leucegene-CaG data set for 2 variants previously associated with AML by GWAS

VariantRisk alleleGeneAll AMLN = 567 cases, 1865 controlsCN-AMLn = 240 cases, 1865 controlsAF of the risk allele
P valueOR (95% CI)P valueOR (95% CI)gnomAD NFEControlCases all AMLCases CN-AML
6:32717773 G>A rs3916765 HLA-DQA2 .13 1.22 (0.94-1.58) .003 1.96 (1.26-3.06) 0.89 0.90 0.92 0.95 
11:68164294 G>A rs4930561 KMT5B .10 1.13 (0.98-1.30) .039 1.24 (1.01-1.52) 0.50 0.49 0.52 0.54 
VariantRisk alleleGeneAll AMLN = 567 cases, 1865 controlsCN-AMLn = 240 cases, 1865 controlsAF of the risk allele
P valueOR (95% CI)P valueOR (95% CI)gnomAD NFEControlCases all AMLCases CN-AML
6:32717773 G>A rs3916765 HLA-DQA2 .13 1.22 (0.94-1.58) .003 1.96 (1.26-3.06) 0.89 0.90 0.92 0.95 
11:68164294 G>A rs4930561 KMT5B .10 1.13 (0.98-1.30) .039 1.24 (1.01-1.52) 0.50 0.49 0.52 0.54 

CI, confidence interval.

AML are classified into molecular subgroups defined by genetic abnormalities.4 AML with mutations at the C terminus of the protein encoded by the nucleophosmin gene NPM1 on chromosome 5 represents a third of AML cases, and 60% of CN-AML cases (see “Methods”).4,39 In the Leucegene-CaG data set, when we stratified the analyses to AML cases with NPM1 mutations (NNPM1 = 196 cases) or with NPM1 mutations and a normal karyotype (NCN-NPM1 = 157 cases), the association with HLA-DQA2-rs3916765 was stronger (with ORs of 2.27 and 2.87, respectively, for the G-allele; supplemental Table 3). The association at KMT5B was also stronger with NPM1-mutated AML and CN-NPM1-mutated AML (supplemental Table 3).

eQTL analysis at rs3916765

We queried the GTEx whole-blood eQTL data set and found that genotypes at rs3916765 are associated with the expression of nearby HLA-DOB and HLA-DQA2 (Figure 2A). To replicate this result, we accessed the previously described whole-blood Leucegene RNA-sequencing data set (N = 567) 22 and tested the association between rs3916765 and gene expression. After correcting for multiple testing, we validated the association between the AML risk G-allele and lower HLA-DOB expression levels (adjusted P value = 5.12 × 10−10; Figure 2B-C; supplemental Table 4). We also detected the association with HLA-DQA2 in Leucegene, although it was only nominally significant (adjusted P = .34, nominal P = 2.2 × 10−4). Three other genes reached statistical significance, including ENSG00000284430 (adjusted P = 1.07 × 10−4), TRBV7-3 (adjusted P = .032), and GOLGA2P6 (adjusted P = .035; supplemental Table 4). ENSG00000284430 (chromosome 19) encodes a long noncoding RNA of unknown function, TRBV7-3 (chromosome 7) encodes a probable nonfunctional open reading frame of the variable domain of T-cell receptor β chain, and GOLGA2P6 (chromosome 10) encodes a Golgin pseudogene. Whether these genes play a role in AML will require further work. The association between genotypes at rs3916765 and HLA-DOB expression levels was significant in Leucegene patients without (n = 371, adjusted P = 1.48 × 10−5) or with (n = 196, adjusted P = .016) NPM1 mutations (supplemental Table 4).

Figure 2.

rs3916765 is an eQTL for HLA class 2 genes in whole-blood. (A) rs3916765 genotype–specific expression of HLA-DQA2 and HLA-DOB in whole-blood from GTEx. (B) Associations between genotypes at rs3916765 and gene expressed in whole-blood in Leucegene. The effect size (x-axis) is given for the AML risk G-allele. P values (y-axis) were adjusted with the Benjamini-Hochberg (false discovery rate) procedure. The horizontal line corresponds to an adjusted P = .05. Red points correspond to HLA class 2 genes. (C) Genotype-specific whole-blood expression of HLA class 2 genes stratified by genotypes at rs3916765 in Leucegene participants. ∗nominal P < .05; ∗∗nominal P < .01; ∗∗∗ nominal P < .001. P_adj, adjusted P value.

Figure 2.

rs3916765 is an eQTL for HLA class 2 genes in whole-blood. (A) rs3916765 genotype–specific expression of HLA-DQA2 and HLA-DOB in whole-blood from GTEx. (B) Associations between genotypes at rs3916765 and gene expressed in whole-blood in Leucegene. The effect size (x-axis) is given for the AML risk G-allele. P values (y-axis) were adjusted with the Benjamini-Hochberg (false discovery rate) procedure. The horizontal line corresponds to an adjusted P = .05. Red points correspond to HLA class 2 genes. (C) Genotype-specific whole-blood expression of HLA class 2 genes stratified by genotypes at rs3916765 in Leucegene participants. ∗nominal P < .05; ∗∗nominal P < .01; ∗∗∗ nominal P < .001. P_adj, adjusted P value.

Close modal

PRSs for leukemia and blood cell traits

There are currently no published PRSs for AML. As exploratory analyses, we calculated PRSs in the Leucegene-CaG data set for acute ALL and CLL. Because AML is a malignancy of the myeloid lineage, we also calculated PRSs for 11 blood-cell traits of the myeloid lineage: basophil count, eosinophil count, hematocrit, mean corpuscular volume, monocyte count, mean platelet volume, neutrophil count, platelet count, red blood cell count, red cell width distribution, and total white blood cell count. When we tested the association between AML risk and the different PRSs, we observed no significant associations after correcting for the number of tests performed (Figure 3; supplemental Figure 6). The most significant association was between AML and the neutrophil PRSs: the risk of AML increased by 11.8% for each standard deviation increased in the neutrophil PRS (nominal P = .02). Except for 2, the variants used in the calculation of the PRS are not significantly associated with AML or CN-AML when taken individually (supplemental Figure 7).

Figure 3.

Association of leukemia and blood-cell traits PRS with AML risk. We calculated PRSs for ALL, CLL, and 7 blood-cell traits in the Leucegene AML cases and CaG controls. The black dots correspond to the ORs per 1 standard deviation in the PRSs. The whiskers are the 95% CIs for the ORs. BASO, basophil count; CI, confidence interval; EOS, eosinophil count; HCT, hematocrit; MCV, mean corpuscular volume; MONO, monocyte count; MPV, mean platelet volume; NEU, neutrophil count; RBC, red blood cell count, RDW, red cell distribution width; PLT, platelet count; WBC, white blood cell count.

Figure 3.

Association of leukemia and blood-cell traits PRS with AML risk. We calculated PRSs for ALL, CLL, and 7 blood-cell traits in the Leucegene AML cases and CaG controls. The black dots correspond to the ORs per 1 standard deviation in the PRSs. The whiskers are the 95% CIs for the ORs. BASO, basophil count; CI, confidence interval; EOS, eosinophil count; HCT, hematocrit; MCV, mean corpuscular volume; MONO, monocyte count; MPV, mean platelet volume; NEU, neutrophil count; RBC, red blood cell count, RDW, red cell distribution width; PLT, platelet count; WBC, white blood cell count.

Close modal

AML is 1 of the most common forms of leukemia affecting adults. Its genetic etiology is complex, involving well-established monogenic alterations as well as modest effect and common germ line variants.1,4,5 GWAS have started to describe the polygenic architecture of AML,1,6-8 but efforts have remained limited because of the low prevalence of the disease and heterogeneity between AML subgroups. To strengthen previous AML GWAS findings, we performed a replication study using data from the well-characterized Leucegene AML cohort and controls from the population-matched CaG cohort.

Despite our limited sample size, we could replicate the association between CN-AML and genotypes at HLA-DQA2-rs3916765. The effect size of this association was strong for a common variant (OR = 1.96), and even stronger when restricting the analysis to patients with an NPM1 mutation (OR = 2.87; Table 1; supplemental Table 3). The risk G-allele at rs3916765 has a frequency of 0.90 in CaG controls and of 0.92 in patients with AML but it reaches 0.95 in patients with CN-AML and 0.96 in those with NPM1-mutated AML or CN-NPM1-mutated AML. Mutant NPM1 proteins can contain peptides that bind HLA class 1 and 2 molecules more efficiently, making AML cells more vulnerable to immune recognition.40 Genotypes at rs3916765 might affect gene expression program of antigen presenting cells, thereby affecting their role in the immune response by providing a protective effect. This could explain that DNA sequence variants near HLA-DQA2 have stronger effect sizes in NPM1-mutated AML or CN-NPM1-mutated AML. However, our eQTL analyses indicate that genotypes at rs3196765 associate with HLA-DOB and HLA-DQA2 expression independently of NPM1 mutations, suggesting that the gene regulatory effect is not dependent on NPM1 status. Given the complexity of the HLA locus in the human genome, more functional work is needed to determine how genotypes at rs3916765 affects AML risk in combination with the NPM1 mutation status.

To perform our analyses, we faced 2 important challenges. First, we called genotypes from low-pass WGS in the Leucegene cohort. Although there are limitations to this approach (eg, the imputation quality of rare variants), Glimpse generated high-quality genotype calls when compared with genotypes obtained using high-coverage WES. Second, because our AML cases and controls were profiled on different platforms and at different times, any technical artifacts (or batch effects) may lead to spurious associations. To minimize this risk, we applied very stringent QC thresholds on the data sets. We explicitly decided not to perform GWA testing but to focus on the replication of known AML variants. We also reasoned that PRSs would be less sensitive to this issue, as a false-positive result would imply systematic biases across hundreds of variants.

In conclusion, the HLA locus is a validated and relatively strong risk factor for AML. Whether it affects response to treatment is currently unknown. More generally, our study provides bioinformatic guidance for future genetic efforts that require combining different data sets generated using different methods. We also validated CaG as an external cohort of controls to enable association studies in the Quebec population.

The authors thank Muriel Draoui for Leucegene project coordination and the personnel at the Institute for Research in Immunology and Cancer (IRIC) genomics platform for sequencing. The authors acknowledge the invaluable contribution of Quebec Leukemia Cell Bank members and of IRIC bioinformatic platform member Patrick Gendron. The authors thank all CaG participants; and Daniel Taliun for helpful advice on the best strategy to call variants from low-pass whole-genome sequence data.

Leucegene sequencing was supported by the Government of Canada through Genome Canada and the Ministère de l’Économie, de l’Innovation et des Exportations du Québec through Génome Québec. Leukemia samples and cytogenetic and clinical data of this cohort were provided by the Quebec Leukemia Cell Bank supported by grants from the Cancer Research Network of the Fonds de Recherche du Québec - Santé. V.-P.L. is supported by the Fondation Charles-Bruneau and by the Cole Foundation and holds a Fonds de Recherche en Santé du Québec clinician scientist award. The sequencing of the CaG data set was covered by grants from Genome Canada, Génome Québec, and Centre Hospitalier Universitaire Ste-Justine (Projet GénoRef-Q). G.L. is supported by a Canada Research Chair (tier 1) and holds funds from the Canadian Institutes of Health Research (projects no. 426541 and no. 486808) and the Montreal Heart Institute Foundation. This research was enabled, in part, by support provided by Calcul Quebec (https://www.calculquebec.ca/en/) and the Digital Research Alliance of Canada (www.computecanada.ca).

Contribution: R.L., V.-P.L., and G.L. conceived and designed the analyses; J.H., G.S., S.L., V.-P.L., and G.L. collected the data; J.H., G.S., S.L., V.-P.L., and G.L. contributed data; R.L. and V.L. performed analyses; J.H., G.S., V.-P.L., and G.L. secured funding and supervised the work; and R.L., V.P.-L., and G.L. wrote the manuscript, with contributions from all authors.

Conflict-of-interest disclosure: The authors declare no competing financial interests.

Correspondence: Guillaume Lettre, Montreal Heart Institute/University of Montreal, 5000 Belanger St, Montreal, QC H1T 1C8, Canada; email: guillaume.lettre@umontreal.ca.

1.
Marrero
RJ
,
Lamba
JK
.
Current landscape of genome-wide association studies in acute myeloid leukemia: a review
.
Cancers (Basel)
.
2023
;
15
(
14
):
3583
.
2.
Canadian Cancer Society
.
Acute myeloid leukemia statistics. Updated February 2024
. Accessed 1 August 2024. https://cancer.ca/en/cancer-information/cancer-types/acute-myeloid-leukemia-aml/statistics.
3.
National Cancer Institute
.
SEER Cancer Stat Facts: Acute Myeloid Leukemia
. Updated April 2025. Accessed 1 August 2024. https://seer.cancer.gov/statfacts/html/amyl.html.
4.
Khoury
JD
,
Solary
E
,
Abla
O
, et al
.
The 5th edition of the World Health Organization Classification of haematolymphoid tumours: myeloid and histiocytic/dendritic neoplasms
.
Leukemia
.
2022
;
36
(
7
):
1703
-
1719
.
5.
Wartiovaara-Kautto
U
,
Hirvonen
EAM
,
Pitkänen
E
, et al
.
Germline alterations in a consecutive series of acute myeloid leukemia
.
Leukemia
.
2018
;
32
(
10
):
2282
-
2285
.
6.
Lin
WY
,
Fordham
SE
,
Hungate
E
, et al
.
Genome-wide association study identifies susceptibility loci for acute myeloid leukemia [published correction appears in Nat Commun. 2022;13(1):2]
.
Nat Commun
.
2021
;
12
(
1
):
6233
.
7.
Walker
CJ
,
Oakes
CC
,
Genutis
LK
, et al
.
Genome-wide association study identifies an acute myeloid leukemia susceptibility locus near BICRA
.
Leukemia
.
2019
;
33
(
3
):
771
-
775
.
8.
Wang
J
,
Clay-Gilmour
AI
,
Karaesmen
E
, et al
.
Genome-wide association analyses identify variants in IRF4 associated with acute myeloid leukemia and myelodysplastic syndrome susceptibility
.
Front Genet
.
2021
;
12
:
554948
.
9.
Artomov
M
,
Loboda
AA
,
Artyomov
MN
,
Daly
MJ
.
Public platform with 39,472 exome control samples enables association studies without genotype sharing
.
Nat Genet
.
2024
;
56
(
2
):
327
-
335
.
10.
Hendricks
AE
,
Billups
SC
,
Pike
HNC
, et al
.
ProxECAT: Proxy External Controls Association Test. A new case-control gene region association test using allele frequencies from public controls
.
PLoS Genet
.
2018
;
14
(
10
):
e1007591
.
11.
Lee
S
,
Kim
S
,
Fuchsberger
C
.
Improving power for rare-variant tests by integrating external controls
.
Genet Epidemiol
.
2017
;
41
(
7
):
610
-
619
.
12.
Moison
C
,
Lavallée
VP
,
Thiollier
C
, et al
.
Complex karyotype AML displays G2/M signature and hypersensitivity to PLK1 inhibition
.
Blood Adv
.
2019
;
3
(
4
):
552
-
563
.
13.
Roy-Gagnon
MH
,
Moreau
C
,
Bherer
C
, et al
.
Genomic and genealogical investigation of the French Canadian founder population structure
.
Hum Genet
.
2011
;
129
(
5
):
521
-
531
.
14.
Awadalla
P
,
Boileau
C
,
Payette
Y
, et al;
CARTaGENE Project
.
Cohort profile of the CARTaGENE study: Quebec’s population-based biobank for public health and personalized genomics
.
Int J Epidemiol
.
2013
;
42
(
5
):
1285
-
1299
.
15.
Spinella
JF
,
Chagraoui
J
,
Moison
C
, et al
.
DELE1 haploinsufficiency causes resistance to mitochondrial stress-induced apoptosis in monosomy 5/del(5q) AML
.
Leukemia
.
2024
;
38
(
3
):
530
-
537
.
16.
Danecek
P
,
Bonfield
JK
,
Liddle
J
, et al
.
Twelve years of SAMtools and BCFtools
.
Gigascience
.
2021
;
10
(
2
):
giab008
.
17.
DePristo
MA
,
Banks
E
,
Poplin
R
, et al
.
A framework for variation discovery and genotyping using next-generation DNA sequencing data
.
Nat Genet
.
2011
;
43
(
5
):
491
-
498
.
18.
Rubinacci
S
,
Ribeiro
DM
,
Hofmeister
RJ
,
Delaneau
O
.
Efficient phasing and imputation of low-coverage sequencing data using large reference panels [published correction appears in Nat Genet. 2021;53(3):412]
.
Nat Genet
.
2021
;
53
(
1
):
120
-
126
.
19.
1000 Genomes Project Consortium
,
Abecasis
GR
,
Auton
A
,
Durbin
RM
, et al
.
A global reference for human genetic variation
.
Nature
.
2015
;
526
(
7571
):
68
-
74
.
20.
Li
H
.
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
.
2013
. arXiv:1303.3997v2 [q-bio.GN]..
21.
Purcell
S
,
Chang
C
.
PLINK v2.00a3.6 AVX2
. Published August 2022. Accessed 1 September 2023. www.cog-genomics.org/plink/2.0/.
22.
Chang
CC
,
Chow
CC
,
Tellier
LC
,
Vattikuti
S
,
Purcell
SM
,
Lee
JJ
.
Second-generation PLINK: rising to the challenge of larger and richer datasets
.
Gigascience
.
2015
;
4
(
1
):
7
.
23.
Fang
H
,
Hui
Q
,
Lynch
J
, et al;
VA Million Veteran Program
.
Harmonizing genetic ancestry and self-identified race/ethnicity in genome-wide association studies
.
Am J Hum Genet
.
2019
;
105
(
4
):
763
-
772
.
24.
Purcell
S
,
Chang
C
.
PLINK v1.90b6.21
. Published October 2020. Accessed 1 September 2023. www.cog-genomics.org/plink/1.9/.
25.
Lisi
V
,
Blanchard
È
,
Vladovsky
M
, et al
.
Unified gene expression signature of novel NPM1 exon 5 mutations in acute myeloid leukemia [published correction appears in Blood Adv. 2023;7(18):5314]
.
Blood Adv
.
2022
;
6
(
17
):
5160
-
5164
.
26.
Gauderman
WJ
.
Sample size requirements for matched case-control studies of gene-environment interaction
.
Stat Med
.
2002
;
21
(
1
):
35
-
50
.
27.
Lavallée
VP
,
Baccelli
I
,
Krosl
J
, et al
.
The transcriptomic landscape and directed chemical interrogation of MLL-rearranged acute myeloid leukemias
.
Nat Genet
.
2015
;
47
(
9
):
1030
-
1037
.
28.
Dobin
A
,
Davis
CA
,
Schlesinger
F
, et al
.
STAR: ultrafast universal RNA-seq aligner
.
Bioinformatics
.
2013
;
29
(
1
):
15
-
21
.
29.
Li
B
,
Dewey
CN
.
RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome
.
BMC Bioinformatics
.
2011
;
12
(
1
):
323
.
30.
R Core Team
. R: A language and environment for statistical computing.
R Foundation for Statistical Computing
;
2020
https://www.R-project.org/. Accessed 1 September 2023.
31.
GTEx Consortium
,
Deluca
DS
,
Segrè
AV
, et al
.
The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans
.
Science
.
2015
;
348
(
6235
):
648
-
660
.
32.
Chen
MH
,
Raffield
LM
,
Mousas
A
, et al
.
Trans-ethnic and ancestry-specific blood-cell genetics in 746,667 individuals from 5 global populations
.
Cell
.
2020
;
182
(
5
):
1198
-
1213.e14
.
33.
Berndt
SI
,
Vijai
J
,
Benavente
Y
, et al
.
Distinct germline genetic susceptibility profiles identified for common non-Hodgkin lymphoma subtypes [published correction appears in Leukemia. 2023;37(10):2142]
.
Leukemia
.
2022
;
36
(
12
):
2835
-
2844
.
34.
Hinrichs
AS
,
Karolchik
D
,
Baertsch
R
, et al
.
The UCSC Genome Browser Database: update 2006
.
Nucleic Acids Res
.
2006
;
34
(
Database issue
):
D590
-
D598
.
35.
Chen
S
,
Francioli
LC
,
Goodrich
JK
, et al
.
A genomic mutational constraint map using variation in 76,156 human genomes [published correction appears in Nature. 2024;626(7997):E1]
.
Nature
.
2024
;
625
(
7993
):
92
-
100
.
36.
Price
AL
,
Patterson
NJ
,
Plenge
RM
,
Weinblatt
ME
,
Shadick
NA
,
Reich
D
.
Principal components analysis corrects for stratification in genome-wide association studies
.
Nat Genet
.
2006
;
38
(
8
):
904
-
909
.
37.
Lv
H
,
Zhang
M
,
Shang
Z
, et al
.
Genome-wide haplotype association study identify the FGFR2 gene as a risk gene for acute myeloid leukemia
.
Oncotarget
.
2017
;
8
(
5
):
7891
-
7899
.
38.
Cao
S
,
Yang
G
,
Zhang
J
, et al
.
Replication analysis confirms the association of several variants with acute myeloid leukemia in Chinese population
.
J Cancer Res Clin Oncol
.
2016
;
142
(
1
):
149
-
155
.
39.
Patel
SS
.
NPM1-mutated acute myeloid leukemia: recent developments and open questions
.
Pathobiology
.
2024
;
91
(
1
):
18
-
29
.
40.
Liso
A
,
Colau
D
,
Benmaamar
R
, et al
.
Nucleophosmin leukaemic mutants contain C-terminus peptides that bind HLA class I molecules
.
Leukemia
.
2008
;
22
(
2
):
424
-
426
.

Author notes

Accessing the CARTaGENE whole-genome sequencing data requires the submission of a request application; the procedure is explained at: https://cartagene.qc.ca/en/researchers/access-request.html. The code used to analyze the data and generate the figures is available upon request from the corresponding author, Guillaume Lettre (guillaume.lettre@umontreal.ca).

The full-text version of this article contains a data supplement.

Supplemental data