Abstract
Background: Therapeutic decision-making in hematology, particularly for patients with multiple myeloma (MM), is often characterized by high variability due to complex patient profiles and multiple therapeutic options. The absence of standardized selection criteria for multi-line treatment leads to inconsistencies and limited reproducibility of clinical decisions. Artificial intelligence (AI)-based clinical decision support systems (CDSS) have the potential to improve the consistency and effectiveness of therapy selection by integrating evidence-based rules and probabilistic outcome models. However, their concordance with expert clinical reasoning and the clinical relevance of their recommendations require further evaluation. Aims: The aims were to: (1) assess inter-expert agreement in selecting treatment regimens for MM clinical cases, (2) compare expert recommendations with an AI-based CDSS (M-BOT), and (3) evaluate the association between AI concordance and predicted probabilities of treatment response.
Methods: Five board-certified hematologists independently evaluated 37 clinical MM cases and selected up to three therapeutic regimens per patient. Inter-expert agreement was quantified using Krippendorff's alpha (α) for multi-choice selections and pairwise Jaccard indices for overlap in selected regimens. Expert decisions were then compared with M-BOT, which generates therapy recommendations based on a curated database of clinical trials and probabilistic modeling of ≥VGPR response rates. We analyzed exact matches (identical sets) and partial overlaps (at least one shared regimen). A total of 383 unique patient–regimen combinations were analyzed. Predicted response probabilities were compared between expert- and AI-selected regimens using the Exact Two-Sample Brown–Mood Median Test.
Results: Overall inter-expert agreement was extremely low (Krippendorff's α = 0.0024), indicating near-random concordance. Exact matches between expert pairs occurred in only 0–2.7% of cases, except for one pair with 37.8% matches. Pairwise Jaccard indices ranged from 0.171 to 0.526, reflecting low-to-moderate overlap. Experts revised 30.8% of their initial therapy choices (57/185) after reviewing AI recommendations, with individual revision rates ranging from 8.1% to 51.4%. Comparison with AI revealed no exact matches across all cases, but partial overlaps were frequent (51–73%), indicating that experts and AI often identified similar components but assembled them differently. Group-level α (experts + AI) was 0.038, underscoring minimal consensus. Importantly, M-BOT regimens demonstrated significantly higher predicted response probabilities compared with expert selections (median 67.8% [IQR 51.1–81.9] vs. 50.0% [IQR 43.0–69.7], p < 0.0001), demonstrating the superior probabilistic efficacy of AI-guided regimens. Overlapping regimens were associated with superior efficacy predictions (median 66.9%) compared to expert-only regimens (46.0%). Line-specific analysis showed consistent superiority of AI predictions across all treatment lines: first-line (80.0% vs. 70.0%), second-line (67.8% vs. 60.2%), and ≥ third-line therapy (64.8% vs. 46.4%; all p < 0.05).
Conclusions: Expert treatment selection in multiple myeloma patients shows high heterogeneity and minimal consensus. M-BOT produces partially overlapping but distinct treatment sets with higher predicted probabilities of response, particularly in later lines of therapy. Consistent with the results, M-BOT recommendations demonstrated higher probabilistic efficacy compared to expert-selected regimens. The AI tool influenced clinical reasoning, with nearly one-third of expert decisions revised after AI review. These findings suggest that AI-based CDSS may serve as a valuable second-opinion tool, improving consistency and guiding the selection of regimens with higher expected efficacy. The M-BOT tool is available at oncotriage.com.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal