Background Machine-learning (ML) is widely promoted for hospital risk prediction, yet its incremental value over well-specified statistical models on structured data remains uncertain. We performed an exhaustive, head-to-head comparison between eight modern ML learners and a survey-weighted, ridge-penalized logistic baseline in a national cohort of chronic lymphocytic leukemia/small lymphocytic lymphoma (CLL/SLL) admissions.

Methods We extracted 117,765 adult CLL/SLL hospitalizations from the 2016-2022 U.S. National Inpatient Sample. Fifty-nine binary ICD-10 features (52 chronic Elixhauser/Charlson comorbidities, 7 acute complications) served as inputs for: (i) ridge logistic regression (baseline); and (ii) elastic-net, LightGBM, XGBoost, random-forest (ranger), multilayer perceptron (MLP), multivariate adaptive regression splines (MARS), k-nearest-neighbors (k-NN) and radial-kernel support-vector machine (SVM). Hyper-parameters were optimized by Bayesian search (20 iterations, five-fold cross-validation) with survey weights preserved throughout. Models were trained on 2016-2020 data and prospectively validated on 2021-2022. We reported discrimination (1,000-bootstrap AUROC with DeLong tests), calibration (intercept, slope, Brier), operating statistics at the Youden cut-point, decision-curve net benefit (5–25 % thresholds) and equity across 23 age-sex–race-payer-region strata. Shapley additive explanations (SHAP) profiled feature importance for LightGBM.

Results Crude in-hospital mortality was 5.8 %. Ridge achieved AUROC 0.851; LightGBM modestly improved this to 0.854 (Δ 0.003; p < 0.01) . XGBoost and ranger registered 0.842 and 0.841, whereas elastic-net matched ridge at 0.851; MLP, MARS, k-NN and SVM ranged 0.726–0.849. Calibration slopes clustered near 1.0 (ridge 0.99, LightGBM 0.95) with Brier differences < 0.002 across models.

At the Youden threshold ridge flagged 8.4 % of admissions (sensitivity 0.634, specificity 0.882, PPV 24.2 %); LightGBM raised sensitivity to 0.659 but fired 14 % more alerts and lowered PPV to 22.6 %. Decision-curve analysis converted that trade-off into ≤0.004 incremental net benefit - <½ additional death detected per 100 admissions - across 5–25 % thresholds .

Performance was stable across demographics: the mean inter-model ΔAUROC was 0.006, and only the Native-American stratum showed Δ > 0.02. SHAP confirmed biological face validity - acute respiratory failure, sepsis, acute kidney injury, pneumonia and tumor-lysis syndrome drove risk in LightGBM, mirroring ridge coefficients and revealing no hidden proxies. Runtime and memory footprints scaled with algorithmic complexity; ridge completed training in 11 s on a standard laptop versus 3.4 min for LightGBM and >15 min for SVM.

Conclusions Across 117,765 nationally representative CLL/SLL admissions, a transparent ridge-penalized logistic model equaled or exceeded six of eight contemporary ML algorithms and trailed the best performer by just 0.003 AUROC. LightGBM's marginal discrimination gain translated into minimal clinical benefit while imposing greater computational cost, lower precision and higher alert burden. Given the convergent feature hierarchies, near-identical calibration and robust subgroup equity, a parsimony-first deployment strategy - ridge or grouped-LASSO score in routine care, ML pipelines reserved for settings that can justify infrastructure and explainability overhead - aligns with emerging WHO/FDA guidance on trustworthy AI. Prospective, multi-site impact studies should test whether such transparent tools improve triage decisions and patient outcomes while maintaining algorithmic fairness.

This content is only available as a PDF.
Sign in via your Institution