Introduction: The promise of real-world data in advancing clinical research for patients with hematologic malignancies is often undermined by limitations in data collection. While Electronic Health Records (EHR) are widely adopted, the collected information is often unstructured. For instance, data that is fundamental to leukemia clinical research such as bone marrow pathology reports remain largely unoptimized for automated extraction and thus inaccessible to large-scale analyses. Large language models (LLM) might represent a solution to this problem by allowing efficient extraction of large amounts of data. However, its application in clinical research remains unclear. Hence, we sought to use LLM to systematically extract information from bone marrow biopsy pathology reports.

Methods: We collected full-text pathology reports from bone marrow biopsies performed to evaluate new onset pancytopenia at Yale-New Haven Hospital. Data extraction was performed on unprocessed text using OpenAI Generative Pre-Trained Transformer 4.1 (gpt-4.1, version 2025-04-14) and gpt-o3 (version 2025-04-16) in a private, HIPAA-compliant environment. Both models were used in their standard version, without any fine-tuning and with a zero-shot prompting strategy. Extracted fields were Medical Record Number (MRN), final diagnosis, biopsy-derived sample quality, cellularity, fibrosis grading, blast percentage; aspirate-derived sample quality, presence of dysplasia, ring sideroblasts and aspirate blast count; and flow cytometry-derived sample quality and blast count. Performances were compared against a dataset manually annotated by expert hematologist review. Accuracy, Agresti-Coull adjusted 95% confidence intervals (CI), hallucination rate and omission rate were computed for each variable. Categorical variables were assessed using Cohen's kappa (k), while numerical variables were evaluated through Spearman's correlation coefficient (r) and Root Mean Square Error (RMSE).

Results: This study included 376 pathology reports from unique patients. For gpt-4.1, accuracy was 96.7%, omission rate was 0.6% and hallucination rate was 0.5%. For categorical variables, perfect extraction (accuracy 100%, k=1) was achieved for MRN and presence of fibrosis. Near perfect (>95%) accuracy was achieved for sample quality for trephine biopsy, aspirate and flow, and presence of ring sideroblasts. Accuracy for final diagnosis was 91% (CI=0.88-0.94, k=0.94). Extraction accuracy of continuous variables was near-perfect (99%) for biopsy cellularity (CI=0.98-1, r=0.99, RMSE=2.1), fibrosis grading (CI=0.98-1, r=0.99, RMSE=0) and aspirate blast count (CI=0.98-1, r=1, RMSE=0). Extraction accuracy was slightly lower for biopsy blast count (95%, CI=0.92-0.97, r=0.99, RMSE=1.1) and flow blast count (97%, CI=0.94-0.98, r=0.99, RMSE=2.2) due to higher hallucination rates (3.2% and 0.8%, respectively). Assessment of dysplasia showed lower accuracy for all lineages: erythroid (94%, k=0.96), granulocytic (92%, k=0.93) and megakaryocytic (83%, k=0.67).

Accuracy for gpt-o3 was increased at 97.4%, with omission and hallucination rates at 0.7% and 0.2%. Performances in categorical variables extraction were slightly increased for megakaryocytic dysplasia, with accuracy at 86% (k=0.8) and were comparable for the remaining variables. For continuous variables, near-perfect accuracy (99%) was confirmed for cellularity (CI=0.98-1, r=1, RMSE=0), fibrosis grading (CI=0.98-0.99, r=1, RMSE=0) and aspirate blast count (CI=0.98-1, r=1, RMSE=0). A minor improvement was observed in the accuracy for biopsy blast count at 98% due to a lower hallucination rate (0%). Runtime for gpt-o3 on the full dataset was significantly longer at 122 minutes compared with 12 minutes for gpt-4.1.

Conclusions: Although prior studies have explored automated extraction from pathology reports using expert systems, rule-based algorithms, and general LLM approaches, this is the first attempt to apply LLM-mediated extraction specifically within malignant hematology, which presents unique challenges compared to solid oncology. Our methodology achieved near perfect accuracy in extracting key bone marrow biopsy datapoints, is scalable for very large datasets and diverse research settings with minimal code adjustments while remaining HIPAA-compliant. The marginal improvements observed with the larger, reasoning gpt-o3 model suggest that smaller, less expensive models can achieve high accuracy with significantly shorter runtimes.

This content is only available as a PDF.
Sign in via your Institution