Figure 3.
Classification model for prediction of BMF etiology in cluster A. (A) Top predictors ranked by importance by the ReliefF method. Feature selection ranked 27 variables by importance and the top 25 variables were considered important predictors for the model. (B) Correlation coefficient (R) between a target of prediction (categorical) and continuous variables. R was calculated and plotted in order of a variable’s importance. (C) A heatmap showing correlation among continuous variables. (D) Confusion matrix with prediction results for the validation cohort. The model was validated in the USP data set. Cases labeled or predicted as acquired are represented by “A,” whereas cases labeled or predicted as inherited are represented by “I.” Model sensitivity represents the ability to correctly predict acquired cases, whereas model specificity is the ability of the model to correctly predict inherited cases. (E) Cases from the cluster A of the USP data set that were misclassified by the model. Cases labeled as acquired or inherited that were correctly predicted by the model are represented with purple circles. Cases labeled as acquired that were predicted as inherited, or labeled as inherited and predicted as acquired are indicated with pink triangles. (F) Prediction results of VUS cases. Results are shown according to clinical diagnosis and mutated genes observed in VUS cases. Germ line VUS were mostly found in TERT (n = 10), SAMD9 or SAMD9L (n = 10), RTEL1 (n = 8), SBF2 (n = 6), and GATA2 (n = 3). Cases predicted as inherited or acquired by the model are represented by red and blue circles, respectively. Of note, SAMD9/L variants are often VUSs because in silico tools do not predict the pathogenicity of gain-of-function variants and many cases are de novo without previous family history. ALC, absolute lymphocyte count; ANC, absolute neutrophil count; BM, bone marrow; Hb, hemoglobin level (g/dL); MCV, mean corpuscular volume.