Identification of prognostic pretreatment cell types by ML. (A) Bootstrapped logistic regression pipeline to predict imatinib treatment outcome based on CML prognostic group. Briefly, given the transcriptome of a single cell, the output identifies GE patterns that predict to which group that cell belongs to. Specifically, the output builds distinct GE-based signatures that classify each cell type into a prognostic group. (B) Top table: accuracy scores (ACC) of (i) multiclass (A vs B vs C) and (ii) binary (C vs not C [AB]) logistic-regression classifiers. The ACC scores for the top 3 models of both multiclass and binary classifiers are highlighted in red. ACC scores are the ratio of true positives and true negatives to all positive and negative observations. Bottom table: left, confusion matrices displaying cell counts for the top 3 cell types in the multiclass classifiers, and right, for the top 3 binary classifiers. In a random model, the ACC = 0.33 for a multiclass classifier and 0.5 for a binary classifier. (C) A leave-one-patient-out classifier to identify marker genes in pseudobulked transcriptomes that can serve as prognostic markers. Only markers from the classifier in panel A are considered as the candidate features in the model. Confusion matrices for a patient-specific (D) multiclass HSC, (E) binary HSC, and NK cell classifier. Precision (Pre) is the ratio between true positives and all positives. Recall (Rec) is a measure of the model’s ability to identify true positives (true positive/true positive and false negative). (F) Top nonzero regression coefficients of the HSCs multiclass patient-specific classifier. (G) Top 20 nonzero regression coefficients identified by the patient-specific binary HSC classifier. (H) Top 20 nonzero regression coefficients identified by the patient-specific binary NK cell classifier. Statistical tests for the ML pipelines are described in the supplemental Methods. Supplemental Figures 4 and 5 and supplemental Tables 14-16 are linked with data shown in Figure 2.