Skip to Main Content

Skip Nav Destination

RESEARCH LETTER| February 26, 2025

Large language models for chart review: how machine learning can accelerate hematology research

Barbara D. Lam,

Barbara D. Lam

1Division of Hematology, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA

2Division of Clinical Informatics, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA

https://orcid.org/0000-0002-5585-3823

Search for other works by this author on:

PubMed

Google Scholar

Peiqi Wang,

Peiqi Wang

3Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA

https://orcid.org/0000-0001-9238-3608

Search for other works by this author on:

PubMed

Google Scholar

Shengling Ma,

Shengling Ma

4Section of Hematology-Oncology, Department of Medicine, Baylor College of Medicine, Houston, TX

https://orcid.org/0000-0002-8651-8892

Search for other works by this author on:

PubMed

Google Scholar

Omid Jafari,

Omid Jafari

5Institute for Clinical and Translational Research, Baylor College of Medicine, Houston, TX

https://orcid.org/0000-0003-3422-2755

Search for other works by this author on:

PubMed

Google Scholar

Iuliia Kovalenko,

Iuliia Kovalenko

6Department of Medicine, University of Pittsburgh Medical Center, Harrisburg, PA

https://orcid.org/0000-0003-4821-2195

Search for other works by this author on:

PubMed

Google Scholar

Ang Li

Ang Li

4Section of Hematology-Oncology, Department of Medicine, Baylor College of Medicine, Houston, TX

Search for other works by this author on:

PubMed

Google Scholar

Blood Vessels, Thrombosis & Hemostasis (2025) 2 (1): 100052.

https://doi.org/10.1016/j.bvth.2025.100052

TO THE EDITOR:

Most physicians remember conducting chart review for research, spending hours clicking through clinical notes, and reading blocks of text. Although many structured fields can be extracted from electronic health records, the complexity of a patient’s story is often found in narrative text only. Data science researchers have long worked to leverage natural language processing tools to extract information from text such as venous thromboembolism events, but performance remains inconsistent across institutions.¹

Large language models (LLMs) have emerged as newer natural language processing technology, showing good performance on tasks ranging from data extraction to sentiment analysis.² Several LLMs have been trained on medical text to improve performance in the health care field.³^,⁴ However, as larger LLMs are released, we hypothesize that even general purpose LLMs can perform well on medical tasks. These new LLMs hold the promise of becoming tools that every investigator can use to accelerate clinical research without specific coding or training.

Ge et al⁵ previously assessed the performance of a general purpose LLM in extracting information about hepatocellular carcinoma from radiology reports. They prompted the LLM to review reports and provide information such as the diameter of the largest lesion and the presence of macrovascular invasion. They did no extra customization of the LLM and were able to demonstrate accuracy between 0.89 to 0.99. They used a small data set of 1101 reports and the model Generative Pre-trained Transformer 4 (GPT-4), which can only be used with health protected information if the institution has a specific agreement in place—something that remains uncommon.

In this study, we experimented with using an open-source LLM to predict acute pulmonary embolism (PE) from a corpus of nearly 20 000 radiology reports. We used Llama-3-8B, a generative LLM released April 2024, that can be downloaded by any researcher for local use with health protected information.

We used the Medical Information Mart for Intensive Care IV Extension Pulmonary Embolism data set, which includes 19 942 computed tomography pulmonary angiogram (CTPA) reports labeled by physician adjudication with 1591 positive PE findings.⁶ We extracted the “Findings” and “Impression” sections of the reports using Regular Expression to shorten the reports and meet the model’s text “token” length requirements. The code for this approach is available as part of the data set. If those sections were not identified in the report, the whole report was included. All data in MIMIC-IV, the database used, has been previously deidentified, and the institutional review boards of the Massachusetts Institute of Technology (number 0403000206) and Beth Israel Deaconess Medical Center (number 2001-P-001699/14) both approved use of the database for research.

We downloaded Llama-3-8B from HuggingFace and ran the model on an Nvidia A6000 graphics processing unit. We first experimented with different approaches to interacting with the model, a process known as prompt engineering, on a data set of 1000 reports (500 PE positive and 500 PE negative) (Table 1). This allowed us to experiment with various prompting strategies to evaluate if they led to better performance. We first attempted a zero-shot approach then added examples in an in-context learning approach (supplemental Table 1). Performance was reported as sensitivity and accuracy. The best performing prompt was then applied to the full corpus of 19 942 CTPA reports. Performance was reported as sensitivity (recall) and positive predictive value (PPV) or precision. Accuracy was not used because it can overestimate model performance on an imbalanced data set.

Table 1.

Prompt engineering with 1000 labeled examples

Prompt only with no labeled examples (zero-shot)

Prompt 1 (shortest prompt)
Ask the model to roleplay as a physician and classify reports directly

Act as an expert radiologist and identify reports that describe acute pulmonary embolism as positive. Output ‘positive’ or ‘negative’ at the end of your response, in a single line that starts with ‘Label:’

Prompt 2 (longer prompt)
Ask the model to roleplay as a physician and classify reports directly, with additional direction

Please review the radiology report and identify which ones describe an acute pulmonary embolism. If there is an acute pulmonary embolism, label the report positive. If not, the report should be labeled negative. Output ‘positive’ or ‘negative’ at the end of your response, in a single line that starts with ‘Label:’

Prompt 3 (longest prompt)
Ask the model to roleplay as a physician and classify reports via chain-of-thought prompting, with additional direction and examples

You are a physician reviewing a radiology report. Please isolate any sentences that describe pulmonary embolism. If the radiology report describes an acute pulmonary embolism, label the report as ‘positive.’ All other reports are labeled ‘negative.’ If there is no pulmonary embolism, chronic pulmonary embolism only, or the findings are equivocal, label the report as ‘negative.’ Other types of findings such as other thrombi or masses are labeled ‘negative.’ Output ‘positive’ or ‘negative’ at the very end of the response, in a single line that starts with ‘Label:’

Best prompt from above with labeled examples (in-context learning)

4 labeled examples

Acute PE (positive)
Negative PE (negative)
Equivocal PE (negative)
Chronic PE (negative)

5 labeled examples

Acute PE (positive)
Subsegmental PE (positive)
Negative PE (negative)
Equivocal PE (negative)
Chronic PE (negative)

6 labeled examples

Acute PE (positive)
Subsegmental PE (positive)
Negative PE (negative)
Equivocal PE (negative)
Chronic PE (negative)
Chronic PE with other findings (negative)

Best prompt from above with maximum number of labeled examples and explanation

Label only

Label: negative

Label and relevant sentence(s)

Label: negative
Relevant sentence: “Chronic appearing pulmonary emboli involving the distal right main pulmonary artery, and multiple lobar, segmental and subsegmental branches of the bilateral upper and lower lobes.”

Label and relevant sentence(s) & explanation

Label: negative
Relevant sentence: “Chronic appearing pulmonary emboli involving the distal right main pulmonary artery, and multiple lobar, segmental and subsegmental branches of the bilateral upper and lower lobes.”
Explanation: This report states that there are chronic pulmonary emboli only with no acute pulmonary emboli. We consider these cases negative.

Prompt only with no labeled examples (zero-shot)
Prompt 1 (shortest prompt) Ask the model to roleplay as a physician and classify reports directly	Act as an expert radiologist and identify reports that describe acute pulmonary embolism as positive. Output ‘positive’ or ‘negative’ at the end of your response, in a single line that starts with ‘Label:’
Prompt 2 (longer prompt) Ask the model to roleplay as a physician and classify reports directly, with additional direction	Please review the radiology report and identify which ones describe an acute pulmonary embolism. If there is an acute pulmonary embolism, label the report positive. If not, the report should be labeled negative. Output ‘positive’ or ‘negative’ at the end of your response, in a single line that starts with ‘Label:’
Prompt 3 (longest prompt) Ask the model to roleplay as a physician and classify reports via chain-of-thought prompting, with additional direction and examples	You are a physician reviewing a radiology report. Please isolate any sentences that describe pulmonary embolism. If the radiology report describes an acute pulmonary embolism, label the report as ‘positive.’ All other reports are labeled ‘negative.’ If there is no pulmonary embolism, chronic pulmonary embolism only, or the findings are equivocal, label the report as ‘negative.’ Other types of findings such as other thrombi or masses are labeled ‘negative.’ Output ‘positive’ or ‘negative’ at the very end of the response, in a single line that starts with ‘Label:’
Best prompt from above with labeled examples (in-context learning)
4 labeled examples	Acute PE (positive) Negative PE (negative) Equivocal PE (negative) Chronic PE (negative)
5 labeled examples	Acute PE (positive) Subsegmental PE (positive) Negative PE (negative) Equivocal PE (negative) Chronic PE (negative)
6 labeled examples	Acute PE (positive) Subsegmental PE (positive) Negative PE (negative) Equivocal PE (negative) Chronic PE (negative) Chronic PE with other findings (negative)
Best prompt from above with maximum number of labeled examples and explanation
Label only	Label: negative
Label and relevant sentence(s)	Label: negative Relevant sentence: “Chronic appearing pulmonary emboli involving the distal right main pulmonary artery, and multiple lobar, segmental and subsegmental branches of the bilateral upper and lower lobes.”
Label and relevant sentence(s) & explanation	Label: negative Relevant sentence: “Chronic appearing pulmonary emboli involving the distal right main pulmonary artery, and multiple lobar, segmental and subsegmental branches of the bilateral upper and lower lobes.” Explanation: This report states that there are chronic pulmonary emboli only with no acute pulmonary emboli. We consider these cases negative.

In our experiment with 1000 reports, Llama-3-8B demonstrated sensitivity near 100% even with zero-shot approaches. The best prompt was the longest prompt with the greatest number of labeled examples, each with the label, the relevant sentence, and an explanation (supplemental Table 2).

When this best performing approach was applied to the full corpus, it took ∼90 minutes for Llama-3-8B to review the reports. We achieved a sensitivity of 98.4% with 1567 of 1591 positive reports accurately identified (Table 2). However, PPV was only 48.2%. An analysis of the 1386 false positives showed that in most cases (72.2%), Llama-3-8B isolated sentences not related to PE. These sentences most commonly described another type of thrombus. In 170 of 1386 cases, Llama-3-8B isolated sentences related to PE but the final assessment was incorrect. Most of these cases (129/170; 75.9%) were reports that described chronic or equivocal findings. In 175 of 1386 cases, Llama-3-8B isolated sentences that described acute PE, but further context from the report revealed a negative finding. Overall, of the 345 chronic PE and 104 equivocal findings in the original data set, 132 (38.3%) and 39 (37.5%) respectively were included in the false-positive predictions.

Table 2.

Performance of Llama-3-8b on full corpus of 19 942 radiology reports

	Positive PE	Negative PE
Llama prediction of positive	1 567	1 386
Llama prediction of negative	24	16 965

LLMs have received increasing attention in the health care field. However, many studies focus on the use of proprietary models such as GPT. While these models are the largest and most capable of their kind, they cannot be applied to health protected data without specific institutional agreements including a business associate agreement in place. This is because the data must be sent out of the hospital walls to another company’s servers for processing. Using GPT on health protected data remains out of reach for many investigators, limiting the potential of LLMs.

In this study, we sought to demonstrate how investigators can use an open-source model to accelerate their own clinical research. Open-source models can be downloaded locally, allowing researchers to run the model and process all data on their own institution’s servers. We applied Llama-3-8B to nearly 20 000 radiology reports and showed that with no coding to train or customize the model, Llama-3-8B was able to rapidly label which reports describe acute PE with a sensitivity of 98.4%. Llama-3-8B isolated the correct relevant sentences in the report in most cases, easing the burden of chart review for those who would prefer to manually adjudicate abbreviated reports. PPV was lower at 48.2% though this was to be expected even with a high performing model given the low incidence of PE in our data set (8% disease prevalence). Our work shows how unsupervised open-source LLMs can act as screening tools to lessen the burden of chart review. In this experiment, Llama-3-8B effectively reduced the number of reports to review from ∼20 000 to 3000 (the number of reports predicted to be positive) in 90 minutes.

Further prompt engineering may improve the model. Our experiments showed how performance can vary depending on the inputs to the model. We asked the model to roleplay as a physician and found that performance improved as the prompts increased in length with more task specification, more examples, and chain-of-thought prompting, consistent with prior research in this field.⁷^,⁸ Prompts can be customized to the task at hand and tested on a smaller labeled data set before being applied to the full cohort.

Better performance may also be achieved with larger models. We used Llama-3-8B, which has 8 billion parameters, because it was the newest open-source model that could be run with minimal computational resources. Larger models are available if researchers have access to more computing power. Proprietary models tend to be larger (GPT-3 has a reported 175 billion parameters) and larger models often demonstrate improved abilities.² As better models become open-source, investigators can leverage them for more complex tasks. We selected just 1 example (identifying acute PE) because of our access to gold standard labels for a large data set. In the future, investigators can experiment with using LLMs to extract more subtle information from notes.

Nearly half the reports labeled as chronic or equivocal in the original data set were identified as positive by Llama-3-8B, illustrating the challenge in interpreting subjective language. LLMs can only be as good as the clarity of the language we use. Continuing to improve the quality and standardization of the data in the electronic health record will allow investigators and ultimately, clinicians and patients, to reap the full benefits of machine learning.

Contribution: B.D.L. and P.W. conceived and designed the study; B.D.L. and I.K. collected data; B.D.L., P.W., S.M., O.J., and A.L. analyzed and interpreted the data; B.D.L. prepared the manuscript draft; and all authors contributed to manuscript revision and approved the final version of the manuscript.

Conflict-of-interest disclosure: P.W. completed a summer internship with Meta. The conception and completion of this study took place before knowledge of the internship. The remaining authors declare no competing financial interests.

The current affiliation for B.D.L. is Division of Hematology and Oncology, Department of Medicine, Fred Hutchinson Cancer Center, University of Washington Medical Center, Seattle, WA.

Correspondence: Barbara D. Lam, Division of Hematology and Oncology, Department of Medicine, Fred Hutchinson Cancer Center, University of Washington Medical Center, 1144 Eastlake Ave E, Seattle, WA 98109; email: blam1@uw.edu.

References

1.

Lam

BD

,

Chrysafi

P

,

Chiasakul

T

, et al.

Machine learning natural language processing for identifying venous thromboembolism: systematic review and meta-analysis

.

Blood Adv

.

2024

;

8

(

12

):

2991

-

3000

.

2.

Thirunavukarasu

AJ

,

Ting

DSJ

,

Elangovan

K

,

Gutierrez

L

,

Tan

TF

,

Ting

DSW

.

Large language models in medicine

.

Nat Med

.

2023

;

29

(

8

):

1930

-

1940

.

3.

Yang

X

,

Chen

A

,

PourNejatian

N

, et al.

A large language model for electronic health records

.

NPJ Digit Med

.

2022

;

5

(

1

):

194

.

4.

Lee

J

,

Yoon

W

,

Kim

S

, et al.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

.

Bioinformatics

.

2020

;

36

(

4

):

1234

-

1240

.

5.

Ge

J

,

Li

M

,

Delk

MB

,

Lai

JC

.

A comparison of a large language model vs manual chart review for the extraction of data elements from the electronic health record

.

Gastroenterology

.

2024

;

166

(

4

):

707

-

709.e3

.

6.

Lam

BD

,

Ma

S

,

Kovalenko

I

, et al.

MIMIC-IV-Ext-PE: using a large language model to predict pulmonary embolism phenotype in the MIMIC-IV dataset

.

arXiv

.

Preprint posted online 29 October 2024

.

https://doi.org/10.48550/arXiv.2411.00044

7.

Meskó

B

.

Prompt engineering as an important emerging skill for medical professionals: tutorial

.

J Med Internet Res

.

2023

;

25

:

e50638

.

8.

Wei

J

,

Wang

X

,

Schuurmans

D

, et al.

Chain-of-thought prompting elicits reasoning in large language models

.

arXiv

.

Preprint posted online 28 January 2022

.

https://doi.org/10.48550/arXiv.2201.11903

Author notes

Data are available at: https://github.com/barbaralam/cxrpe.

The full-text version of this article contains a data supplement.

© 2025 American Society of Hematology. Published by Elsevier Inc. Licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), permitting only noncommercial, nonderivative use with attribution. All other rights reserved.

2025