While the use of NCCN guidelines is essential in lymphoma care, adherence among oncologists remains highly variable. Real-world concordance ranges from 65–80% overall, dropping to as low as 11–46% for certain lymphoma subtypes. Although 88% of clinicians report using NCCN guidelines for Hodgkin lymphoma, adherence to PET-adapted approaches is inconsistent. This gap highlights a critical need for tools that support guideline-concordant care.Large language models (LLMs) such as GPT–4 promise decision support, yet hallucinations and divergence from evidence-based care remain obstacles. We assessed whether retrieval-augmented generation (RAG) with National Comprehensive Cancer Network (NCCN) guidelines improves GPT–4 performance in lymphoma treatment recommendation tasks.

To address this, we developed a RAG-enhanced GPT-4 agent (RAG-GPT) by indexing the 2025 NCCN Lymphoma Guidelines into a vector database. Fifty clinical vignettes spanning diverse lymphoma subtypes were compiled from published case reports; this sample provides 80 % power (α = 0.05) to detect a mean score difference of ≥ 0.18 on the modified Generative AI Performance Score (mG-PS). Guideline concordance was assessed based on NCCN guidelines and clinical judgment reflecting real-world oncological decision-making. Scores were normalized between -1.0 and 1.0 for consistent evaluation. Readability and rationale clarity were rated on a 5-point Likert scale. Baseline GPT-4 and RAG-GPT generated treatment plans via standardized prompts. An oncologist scored all outputs with mG-PS, which rates NCCN concordance (0.0–1.0) and penalizes hallucinations (-0.25 to –1.0).

RAG-GPT achieved a higher mean mG-PS (0.32) than baseline GPT–4 (0.10) (t(98)=-2.11, p=0.038; mean score difference ΔmG-PS=0.22). Importantly, RAG-GPT eliminated severe hallucinations, reinforcing confidence in the safety of its treatment reccomendations. Fully concordant “gold-standard” plans were produced in 46 % of vignettes with RAG-GPT versus 22 % with baseline. Readability scores favored RAG-GPT modestly without statistical significance.

Lymphoma's bimodal age distribution, heterogeneity, and lack of standardized treatment algorithms, coupled with regimen toxicity and the need to account for patient frailty and comorbidities, create complex clinical scenarios that constrain LLM applicability for nuanced decision-making. Non-Hodgkin lymphoma, particularly relapsed/refractory, transformed disease, or rare subtypes, lacks a defined standard of care, leading reviewers to rely on clinical judgment or published case series beyond NCCN guidelines. This underscores a key limitation of LLM-supported tools: their performance is highly dependent on the presence of clear algorithmic consensus within a given disease. Linking GPT-4 to NCCN guidelines via RAG significantly improved treatment accuracy and reduced hallucination severity in lymphoma, though performance remained lower than prior RAG studies in cancers like breast and CNS suggesting a greater impact on hallucination reduction than on retrieval of a consistent gold standard regimen. These findings support continued development and prospective validation of guideline-anchored LLMs for clinical use.

This content is only available as a PDF.
Sign in via your Institution