Figure 1.
Schematic representation of the data collection and analysis. (A) Data from different data sources, including baseline tests (), routine laboratory tests (), and recurrent mutations (), were combined to construct a heterogeneous data set. Prediction point was set at 3 months postdiagnosis, and clinical outcomes () were predicted. (B) The clinical outcomes were death (), treatment (), the combined event of treatment or infection (composite), and infection (). (C) Based on the combination of feature sets, 4 models were defined: (1) IPI, which included CLL-IPI score and the CLL-IPI features only; (2) +BL, which included CLL-IPI features, baseline tests, and routine laboratory tests; (3) +MUT, which included CLL-IPI features and recurrent mutations; and (4) ALL, which included all features. (D) Clinical outcomes were predicted in 2- and 5-year outlooks postdiagnosis (except for the first 3 months). (E) The data from different sources were merged to create one data set (). Then, for a specific outcome and outlook, the target values were created and later used in the training/test (). Based on the model, feature extraction was performed (). A stacked ML model consisting of 7 algorithms and a fusion stage based on majority voting was trained and tested. The performance of the models () and the contribution of the features () were estimated to identify the risk factors predictive of each combination of outcome, model, and outlook. tNGS, targeted next-generation sequencing.