Introduction
Graft-Versus-Host Disease(GVHD) following hematopoietic stem cell transplant is a significant complication, causing a 10-30% mortality rate. Staging and grading are often performed by physicians in training, such as residents and fellows, and the assessments can be inconsistent between different physicians and clinical centers. This inconsistency may lead to confusion for both the medical team and the patients. This study evaluates the performance of an artificial intelligence (AI) tool in determining the grade of acute GVHD, exploring its potential role in the classification of GVHD.
Method
We conducted a performance assessment of Chat-GPT-3.5 and Chat-GPT-4 in grading acute GVHD. We constructed a dataset comprising simulated patients with acute GVHD grades 1-4. We then posed a question to both Chat-GPT-3.5 and Chat-GPT-4.0, asking for the severity of GVHD in each patient. The physicians' assessment was based on the modified Glucksberg criteria.
Result
Both Chat-GPT tools use the Glucksberg criteria as a reference source. Overall, Chat-GPT-4.0 demonstrated 75% concordance with physician assessment, whereas Chat-GPT-3.5 showed 62.5% concordance. When provided with numerical data, both tools successfully graded acute GVHD with 100% accuracy. However, when given subjective data, Chat-GPT-4.0 exhibited 50% concordance with physician assessment, whereas Chat-GPT-3.5 showed 25%. Although Chat-GPT was able to provide staging for each organ involvement, it struggled to estimate the volume of diarrhea based on the number of episodes. For instance, when asked about the grading of a patient experiencing more than 20 episodes of diarrhea per day, the system concluded that the patient had acute GVHD grade 4. This suggests that Chat-GPT-4.0 has difficulty accurately assessing the amount of diarrhea unless it's exceeding significant amount.
Chat-GPT-3.5 exhibited challenges in accurately quantifying the Body Surface Area (BSA). It was unable to assess the BSA if it was less than 25%. Moreover, if the rash involved more than two areas of the body, even when described with alternative language (e.g., ‘torso’, ‘entire torso‘), it would disproportionately inflated the BSA.
Conclusion
The above results demonstrate that both tools are insufficient for accurately determining the grading of acute GVHD when only subjective data is provided, primarily due to the incapacity of Chat-GPT-3.5 to assess the Body Surface Area (BSA) and Chat-GPT-4.0 to evaluate the volume of diarrhea. However, both tools displayed 100% accuracy when objective data was provided, with Chat-GPT-4.0 notably capable of accurately indicating the stages of organ involvement. Healthcare professionals should not solely rely on Chat-GPT for assessment, and it is advisable to instruct patients against utilizing Chat-GPT to evaluate their own acute GVHD.
Disclosures
No relevant conflicts of interest to declare.