Abstract P005

Accuracy of Ann Arbor Stage Assignment Based on PET/CT Reports Using a Large Language Model in Hodgkin Lymphoma Patients

Aim: Large language models (LLMs) have recently shown remarkable performance in solving tasks across various fields. Growing evidence suggests that they might be useful for patient self-education and the choice of diagnostic work-up. However, it remains unclear whether artificial intelligence can support complex decision processes that rely on different types of information from imaging modalities such as positron emission tomography (PET) or computed tomography (CT). Therefore, we investigated the accuracy of an advanced LLM in defining disease stages based on diagnostic reports generated for Hodgkin lymphoma patients.

Methods: Our analysis set included 70 consecutive written PET/CT reports of treatment-naïve Hodgkin lymphoma patients, which were slightly modified to remove the physicians' disease classifications. The most probable Ann Arbor stage for each patient was determined in five independent runs using GPT-4 (OpenAI, Inc., San Francisco, CA). To address potential interpretation errors arising from individual report diction, structured summaries of findings were examined as a second step. We then calculated and compared overall and per-stage accuracy for both text formats.

Results: The model’s mean overall accuracy for disease extent classification was 60.0% (range, 57.1–64.3) when entering complete PET/CT reports, with a slight increase to 64.3% (range, 60.0–70.0, P = .08) upon presentation of structured summaries. While 37.2% of individuals were falsely assigned higher categories based on the standard texts, GPT-4 proposed lower stages in 2.9%. A notably superior mean accuracy of 93.3% (range, 86.7–100) and 98.7% (range, 93.3–100) was achieved for stage IV patients when using the complete diagnostic reports and their formatted versions, respectively.

Conclusions: Our study reveals that the accuracy of GPT-4 in Ann Arbor stage assignment based on written PET/CT reports is, so far, insufficient for clinical practice. However, its performance seems to improve slightly when using structured summaries as input. Moreover, furnishing LLMs with context-specific knowledge will presumably further increase their potential in the future.

Authors

Conrad-Amadeus Voltin, Jonathan Kottlors, Peter Borchmann, Philipp Gödel, Alexander Drzezga, Markus Dietlein, Thomas Dratsch