Abstract
The variable pattern of complications seen among sickle cell disease patients is unlikely to be the result of solely the HBB glu6val mutation. Conventional case-control analyses using single nucleotide polymorphisms (SNPs) are useful in looking for single gene associations, however studying gene interactions becomes increasingly inefficient as the number of SNPs increase. Application of more complex statistical methods such as classification and regression trees (CART), stochastic gradient boosting (SGB) and Bayesian networks (
Sebastiani et al, Nature Genet 37: 435, 2005
) allows for the analysis of many SNPs and covariates simultaneously. CART is a recursive partitioning method that develops a single tree-based model which is grown by creating “if-then” splitting rules which stratify patients into risk groups. The resulting over-fitted tree is then pruned to optimize size and classification and its predictive ability assessed using a test set or cross validation. SGB is a similar method, however, instead of one large tree many small trees are grown sequentially. Each new tree improves the quality of the model based upon the prior stage and is weighted based on its predictive ability. The final predicted value is computed by adding the weighted contribution of each small sub-tree and based on that value, a classification assigned. From the over 4,000 patients in the CSSCD, a subset of 1,353 were genotyped for 353 SNPs in over 160 genes that might impact the disease pathophysiology. Clinical data for these patients were merged with genotype data and 490 patients were identified who had at least one of the following vasoocclusive events: stroke, osteonecrosis of the humeral or femoral head and/or priapism. With the remaining patients serving as controls, CART and SGB was run on a random sample of 80% of the patients to identify genes and covariates whose interactions characterize patients with these vasoocclusive events and the accuracy assessed using the remaining patients as a test set. Along with age, sex and HbF, CART identified TGFBR3, SMAD9, CISH, KL and MAP3K71P1 as being associated with the vasoocclusive event phenotype and classified patients with a sensitivity of 34% and a specificity of 82%. SGB however, while identifying the same pattern of genes and covariates as being associated with vasoocclusive events, was able to classify patients with a sensitivity of 80% and a specificity of 68%. While CART provides a simple initial screening of genes and covariates that may be simultaneously associated with the phenotype, SGB provides a more accurate method of classification with higher sensitivity and similar specificity. Neither method requires the extensive model building of Bayesian networks. A more thorough understanding of the molecular mechanisms will lead to the ability to predict subphenotypes and improve their management.Author notes
Corresponding author
2005, The American Society of Hematology
2005